

# Getting started with SageMaker HyperPod
<a name="smcluster-getting-started-slurm"></a>

Get started with creating your first SageMaker HyperPod cluster and learn the cluster operation functionalities of SageMaker HyperPod. You can create a SageMaker HyperPod cluster through the SageMaker AI console UI or the AWS CLI commands. This tutorial shows how to create a new SageMaker HyperPod cluster with Slurm, which is a popular workload scheduler software. After you go through this tutorial, you will know how to log into the cluster nodes using the AWS Systems Manager commands (`aws ssm`). After you complete this tutorial, see also [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md) to learn more about the SageMaker HyperPod basic oparations, and [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md) to learn how to schedule jobs on the provisioned cluster.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [

# Getting started with SageMaker HyperPod using the SageMaker AI console
](smcluster-getting-started-slurm-console.md)
+ [

# Creating SageMaker HyperPod clusters using CloudFormation templates
](smcluster-getting-started-slurm-console-create-cluster-cfn.md)
+ [

# Getting started with SageMaker HyperPod using the AWS CLI
](smcluster-getting-started-slurm-cli.md)

# Getting started with SageMaker HyperPod using the SageMaker AI console
<a name="smcluster-getting-started-slurm-console"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the SageMaker AI console UI. Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes, `my-controller-group`, `my-login-group`, and `worker-group-1`.

**Topics**
+ [

## Create cluster
](#smcluster-getting-started-slurm-console-create-cluster-page)
+ [

## Deploy resources
](#smcluster-getting-started-slurm-console-create-cluster-deploy)
+ [

## Delete the cluster and clean resources
](#smcluster-getting-started-slurm-console-delete-cluster-and-clean)

## Create cluster
<a name="smcluster-getting-started-slurm-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose **Slurm** orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Slurm**.

1. On the Slurm cluster creation page, you will see two options. Choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-slurm-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group for Controller and Compute group types.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-slurm-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-slurm-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**.

### Networking
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-network"></a>

Configure your network settings for the cluster creation. These settings can't be changed after the cluster is created.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security groups**, create a security group or choose up to five security groups configured with rules to allow inter-resource communication within the VPC.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf.

### Storage
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-slurm-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-slurm-console-create-cluster-cfn.md).

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-console-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI instances when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of two instance groups. One of them uses a C5 instance, so make sure you delete the cluster by following the instructions at [Delete a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster).

However, if you have created a cluster with reserved compute capacity, the status of the clusters does not affect service billing.

To clean up the lifecycle scripts from the S3 bucket used for this tutorial, go to the S3 bucket you used during cluster creation and remove the files entirely.

If you have tested running any workloads on the cluster, make sure if you have uploaded any data or if your job saved any artifacts to different S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent any incurring charges, delete all artifacts and data from the storage or file system.

# Creating SageMaker HyperPod clusters using CloudFormation templates
<a name="smcluster-getting-started-slurm-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install AWS CLI to proceed.

**Topics**
+ [

## Configure resources in the console and deploy using CloudFormation
](#smcluster-getting-started-slurm-console-create-cluster-deploy-console)
+ [

## Configure resources and deploy using CloudFormation
](#smcluster-getting-started-slurm-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-console"></a>

You can configure resources using the AWS Management Console and deploy using the CloudFormation templates. 

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the AWS CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the AWS CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.aws.amazon.com/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure resources and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-cfn"></a>

You can configure resources and deploy using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

# Getting started with SageMaker HyperPod using the AWS CLI
<a name="smcluster-getting-started-slurm-cli"></a>

Create your first SageMaker HyperPod cluster using the AWS CLI commands for HyperPod.

## Create your first SageMaker HyperPod cluster with Slurm
<a name="smcluster-getting-started-slurm-cli-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-cli). Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes: `my-controller-group`, `my-login-group`, and `worker-group-1`.

With the API-driven configuration approach, you define Slurm node types and partition assignments directly in the CreateCluster API request using `SlurmConfig`. This eliminates the need for a separate `provisioning_parameters.json` file and provides built-in validation, drift detection, and per-instance-group FSx configuration.

1. First, prepare and upload lifecycle scripts to an Amazon S3 bucket. During cluster creation, HyperPod runs them in each instance group. Upload lifecycle scripts to Amazon S3 using the following command.

   ```
   aws s3 sync \
       ~/local-dir-to-lifecycle-scripts/* \
       s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
   ```
**Note**  
The S3 bucket path should start with a prefix `sagemaker-`, because the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) with `AmazonSageMakerClusterInstanceRolePolicy` only allows access to Amazon S3 buckets that starts with the specific prefix.

   If you are starting from scratch, use sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following sub-steps show how to download and upload the sample lifecycle scripts to an Amazon S3 bucket.

   1. Download a copy of the lifecycle script samples to a directory on your local computer.

      ```
      git clone https://github.com/aws-samples/awsome-distributed-training/
      ```

   1. Go into the directory [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config), where you can find a set of lifecycle scripts.

      ```
      cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
      ```

      To learn more about the lifecycle script samples, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

   1. Upload the scripts to `s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src`. You can do so by using the Amazon S3 console, or by running the following AWS CLI Amazon S3 command.

      ```
      aws s3 sync \
          ~/local-dir-to-lifecycle-scripts/* \
          s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
      ```
**Note**  
With API-driven configuration, you do not need to create or upload a `provisioning_parameters.json` file. The Slurm configuration is defined directly in the CreateCluster API request in the next step.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) request file in JSON format and save as `create_cluster.json`.

   With API-driven configuration, you specify the Slurm node type and partition assignment for each instance group using the `SlurmConfig` field. You also configure the cluster-level Slurm settings using `Orchestrator.Slurm`.

   For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole",
               "InstanceStorageConfigs": [
                   {
                       "EbsVolumeConfig": {
                           "VolumeSizeInGB": 500
                       }
                   }
               ]
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       }
   }
   ```

   **SlurmConfig fields:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **Orchestrator.Slurm fields:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **SlurmConfigStrategy options:**
   + `Managed` (recommended): HyperPod fully manages `slurm.conf` and detects unauthorized changes (drift detection). Updates fail if drift is detected.
   + `Overwrite`: HyperPod overwrites `slurm.conf` on updates, ignoring any manual changes.
   + `Merge`: HyperPod preserves manual changes and merges them with API configuration.

   **Adding FSx for Lustre (optional):**

   To mount an FSx for Lustre filesystem to your compute nodes, add `FsxLustreConfig` to the `InstanceStorageConfigs` for the instance group. This requires a Custom VPC configuration.

   ```
   {
       "InstanceGroupName": "worker-group-1",
       "InstanceType": "ml.trn1.32xlarge",
       "InstanceCount": 1,
       "SlurmConfig": {
           "NodeType": "Compute",
           "PartitionNames": ["partition-1"]
       },
       "InstanceStorageConfigs": [
           {
               "FsxLustreConfig": {
                   "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                   "MountPath": "/fsx",
                   "MountName": "abcdefgh"
               }
           }
       ],
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
   }
   ```

   **Adding FSx for OpenZFS (optional):**

   You can also mount FSx for OpenZFS filesystems:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-0xyz789abc123456.fsx.us-west-2.amazonaws.com",
               "MountPath": "/shared"
           }
       }
   ]
   ```
**Note**  
Each instance group can have at most one FSx for Lustre and one FSx for OpenZFS configuration. Different instance groups can mount different filesystems.

   **Adding VPC configuration (required for FSx):**

   If using FSx, you must specify a Custom VPC configuration:

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "VpcConfig": {
           "SecurityGroupIds": ["sg-0abc123def456789a"],
           "Subnets": ["subnet-0abc123def456789a"]
       }
   }
   ```

1. Run the following command to create the cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the created cluster.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster"
   }
   ```

   If you receive an error due to resource limits, ensure that you change the instance type to one with sufficient quotas in your account, or request additional quotas by following [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   **Common validation errors:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

1. Run `describe-cluster` to check the status of the cluster.

   ```
   aws sagemaker describe-cluster --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster",
       "ClusterName": "my-hyperpod-cluster",
       "ClusterStatus": "Creating",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "CreationTime": "2024-01-15T10:30:00Z"
   }
   ```

   After the status of the cluster turns to **InService**, proceed to the next step. Cluster creation typically takes 10-15 minutes.

1. Run `list-cluster-nodes` to check the details of the cluster nodes.

   ```
   aws sagemaker list-cluster-nodes --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterNodeSummaries": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceId": "i-0abc123def456789a",
               "InstanceType": "ml.c5.xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceId": "i-0abc123def456789b",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceId": "i-0abc123def456789c",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:36:00Z"
           }
       ]
   }
   ```

   The `InstanceId` is what your cluster users need for logging (`aws ssm`) into them. For more information about logging into the cluster nodes and running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

1. Connect to your cluster using AWS Systems Manager Session Manager.

   ```
   aws ssm start-session \
       --target sagemaker-cluster:my-hyperpod-cluster_my-login-group-i-0abc123def456789b \
       --region us-west-2
   ```

   Once connected, verify Slurm is configured correctly:

   ```
   # Check Slurm nodes
   sinfo
   
   # Check Slurm partitions
   sinfo -p partition-1
   
   # Submit a test job
   srun -p partition-1 --nodes=1 hostname
   ```

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-cli-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI capacity when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of three instance groups. Make sure you delete the cluster by running the following command.

```
aws sagemaker delete-cluster --cluster-name my-hyperpod-cluster
```

To clean up the lifecycle scripts from the Amazon S3 bucket used for this tutorial, go to the Amazon S3 bucket you used during cluster creation and remove the files entirely.

```
aws s3 rm s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src --recursive
```

If you have tested running any model training workloads on the cluster, also check if you have uploaded any data or if your job has saved any artifacts to different Amazon S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent incurring charges, delete all artifacts and data from the storage or file system.

## Related topics
<a name="smcluster-getting-started-slurm-cli-related-topics"></a>
+ [SageMaker HyperPod Slurm configuration](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-configuration)
+ [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [FSx configuration via InstanceStorageConfigs](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-fsx-config)
+ [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)