

# Jobs on SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-run-jobs-slurm"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters. Examples of running ML workloads on HyperPod clusters are also provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following topics walk you through how to log in to the provisioned HyperPod clusters and get you started with running sample ML workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [Accessing your SageMaker HyperPod cluster nodes](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md)
+ [Scheduling a Slurm job on a SageMaker HyperPod cluster](sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job.md)
+ [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md)
+ [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md)

# Accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes"></a>

You can access your **InService** cluster through AWS Systems Manager (SSM) by running the AWS CLI command `aws ssm start-session` with the SageMaker HyperPod cluster host name in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

For advanced settings for practical use of HyperPod clusters, see the following topics.

**Topics**
+ [Additional tips for accessing your SageMaker HyperPod cluster nodes](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips)
+ [Set up a multi-user environment through the Amazon FSx shared space](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space)
+ [Set up a multi-user environment by integrating HyperPod clusters with Active Directory](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory)

## Additional tips for accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips"></a>

**Use the `easy-ssh.sh` script provided by HyperPod for simplifying the connection process**

To make the previous process into a single line command, the HyperPod team provides the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script that retrieves your cluster information, aggregates them into the SSM command, and connects to the compute node. You don't need to manually look for the required HyperPod cluster information as this script runs `describe-cluster` and `list-cluster-nodes` commands and parses the information needed for completing the SSM command. The following example commands show how to run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script. If it runs successfully, you'll be connected to the cluster as the root user. It also prints a code snippet to set up SSH by adding the HyperPod cluster as a remote host through an SSM proxy. By setting up SSH, you can connect your local development environment such as Visual Studio Code with the HyperPod cluster.

```
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c <node-group> <cluster-name>
Cluster id: <cluster_id>
Instance id: <instance_id>
Node Group: <node-group>
Add the following to your ~/.ssh/config to easily connect:

$ cat <<EOF >> ~/.ssh/config
Host <cluster-name>
  User ubuntu
  ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
EOF

Add your ssh keypair and then you can do:

$ ssh <cluster-name>

aws ssm start-session --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id>

Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

**Set up for easy access with SSH by using the HyperPod compute node as a remote host**

To further simplify access to the compute node using SSH from a local machine, the `easy-ssh.sh` script outputs a code snippet of setting up the HyperPod cluster as a remote host as shown in the previous section. The code snippet is auto-generated to help you directly add to the `~/.ssh/config` file on your local device. The following procedure shows how to set up for easy access using SSH through the SSM proxy, so that you or your cluster users can directly run `ssh <cluster-name>` to connect to the HyperPod cluster node.

1. On your local device, add the HyperPod compute node with a user name as a remote host to the `~/.ssh/config` file. The following command shows how to append the auto-generated code snippet from the `easy-ssh.sh` script to the `~/.ssh/config` file. Make sure that you copy it from the auto-generated output of the `easy-ssh.sh` script that has the correct cluster information.

   ```
   $ cat <<EOF >> ~/.ssh/config
   Host <cluster-name>
     User ubuntu
     ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
   EOF
   ```

1. On the HyperPod cluster node, add the public key on your local device to the `~/.ssh/authorized_keys` file on the HyperPod cluster node.

   1. Print the public key file on your local machine.

      ```
      $ cat ~/.ssh/id_rsa.pub
      ```

      This should return your key. Copy the output of this command. 

      (Optional) If you don't have a public key, create one by running the following command.

      ```
      $ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
      ```

   1. Connect to the cluster node and switch to the user to add the key. The following command is an example of accessing as the `ubuntu` user. Replace `ubuntu` to the user name for which you want to set up the easy access with SSH.

      ```
      $ ./easy-ssh.sh -c <node-group> <cluster-name>
      $ sudo su - ubuntu
      ubuntu@ip-111-22-333-444:/usr/bin#
      ```

   1. Open the `~/.ssh/authorized_keys` file and add the public key at the end of the file.

      ```
      ubuntu@ip-111-22-333-444:/usr/bin# vim ~/.ssh/authorized_keys
      ```

After you finish setting up, you can connect to the HyperPod cluster node as the user by running a simplified SSH command as follows.

```
$ ssh <cluster-name>
ubuntu@ip-111-22-333-444:/usr/bin#
```

Also, you can use the host for remote development from an IDE on your local device, such as [Visual Studio Code Remote - SSH](https://code.visualstudio.com/docs/remote/ssh).

## Set up a multi-user environment through the Amazon FSx shared space
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space"></a>

You can use the Amazon FSx shared space to manage a multi-user environment in a Slurm cluster on SageMaker HyperPod. If you have configured your Slurm cluster with Amazon FSx during the HyperPod cluster creation, this is a good option for setting up workspace for your cluster users. Create a new user and setup the home directory for the user on the Amazon FSx shared file system.

**Tip**  
To allow users to access your cluster through their user name and dedicated directories, you should also associate them with IAM roles or users by tagging them as guided in **Option 2** of step 5 under the procedure **To turn on Run As support for Linux and macOS managed nodes** provided at [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html) in the AWS Systems Manager User Guide. See also [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

**To set up a multi-user environment while creating a Slurm cluster on SageMaker HyperPod**

The SageMaker HyperPod service team provides a script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) as part of the base lifecycle script samples. 

1. Prepare a text file named `shared_users.txt` that you need to create in the following format. The first column is for user names, the second column is for unique user IDs, and the third column is for the user directories in the Amazon FSx shared space.

   ```
   username1,uid1,/fsx/username1
   username2,uid2,/fsx/username2
   ...
   ```

1. Make sure that you upload the `shared_users.txt` and [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) files to the S3 bucket for HyperPod lifecycle scripts. While the cluster creation, cluster update, or cluster software update is in progress, the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) reads in the `shared_users.txt` and sets up the user directories properly.

**To create new users and add to an existing Slurm cluster running on SageMaker HyperPod **

1. On the head node, run the following command to save a script that helps create a user. Make sure that you run this with sudo permissions.

   ```
   $ cat > create-user.sh << EOL
   #!/bin/bash
   
   set -x
   
   # Prompt user to get the new user name.
   read -p "Enter the new user name, i.e. 'sean': 
   " USER
   
   # create home directory as /fsx/<user>
   # Create the new user on the head node
   sudo useradd \$USER -m -d /fsx/\$USER --shell /bin/bash;
   user_id=\$(id -u \$USER)
   
   # add user to docker group
   sudo usermod -aG docker \${USER}
   
   # setup SSH Keypair
   sudo -u \$USER ssh-keygen -t rsa -q -f "/fsx/\$USER/.ssh/id_rsa" -N ""
   sudo -u \$USER cat /fsx/\$USER/.ssh/id_rsa.pub | sudo -u \$USER tee /fsx/\$USER/.ssh/authorized_keys
   
   # add user to compute nodes
   read -p "Number of compute nodes in your cluster, i.e. 8: 
   " NUM_NODES
   srun -N \$NUM_NODES sudo useradd -u \$user_id \$USER -d /fsx/\$USER --shell /bin/bash;
   
   # add them as a sudoer
   read -p "Do you want this user to be a sudoer? (y/N):
   " SUDO
   if [ "\$SUDO" = "y" ]; then
           sudo usermod -aG sudo \$USER
           sudo srun -N \$NUM_NODES sudo usermod -aG sudo \$USER
           echo -e "If you haven't already you'll need to run:\n\nsudo visudo /etc/sudoers\n\nChange the line:\n\n%sudo   ALL=(ALL:ALL) ALL\n\nTo\n\n%sudo   ALL=(ALL:ALL) NOPASSWD: ALL\n\nOn each node."
   fi
   EOL
   ```

1. Run the script with the following command. You'll be prompted for adding the name of a user and the number of compute nodes that you want to allow the user to access.

   ```
   $ bash create-user.sh
   ```

1. Test the user by running the following commands. 

   ```
   $ sudo su - <user> && ssh $(srun hostname)
   ```

1. Add the user information to the `shared_users.txt` file, so the user will be created on any new compute nodes or new clusters.

## Set up a multi-user environment by integrating HyperPod clusters with Active Directory
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory"></a>

In practical use cases, HyperPod clusters are typically used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files and run their own jobs without impacting each other's work. To set up a multi-user environment, use the Linux user and group mechanism to statically create multiple users on each instance through lifecycle scripts. However, the drawback to this approach is that you need to duplicate user and group settings across multiple instances in the cluster to keep a consistent configuration across all instances when you make updates such as adding, editing, and removing users.

To solve this, you can use [Lightweight Directory Access Protocol (LDAP)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) and [LDAP over TLS/SSL (LDAPS)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) to integrate with a directory service such as [AWS Directory Service for Microsoft Active Directory](https://aws.amazon.com/directoryservice/). To learn more about setting up Active Directory and a multi-user environment in a HyperPod cluster, see the blog post [Integrate HyperPod clusters with Active Directory for seamless multi-user login](https://aws.amazon.com/blogs/machine-learning/integrate-hyperpod-clusters-with-active-directory-for-seamless-multi-user-login/).

# Scheduling a Slurm job on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job"></a>

You can launch training jobs using the standard Slurm `sbatch` or `srun` commands. For example, to launch an 8-node training job, you can run `srun -N 8 --exclusive train.sh` SageMaker HyperPod supports training in a range of environments, including `conda`, `venv`, `docker`, and `enroot`. You can configure an ML environment by running lifecycle scripts on your SageMaker HyperPod clusters. You also have an option to attach a shared file system such as Amazon FSx, which can also be used as a virtual environment.

The following example shows how to run a job for training Llama-2 with the Fully Sharded Data Parallelism (FSDP) technique on a SageMaker HyperPod cluster with an Amazon FSx shared file system. You can also find more examples from the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

**Tip**  
All SageMaker HyperPod examples are available in the `3.test_cases` folder of the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

1. Clone the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/), and copy the training job examples to your Amazon FSx file system. 

   ```
   $ TRAINING_DIR=/fsx/users/my-user/fsdp
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   ```

1. Run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh) script. This creates a `conda` environment on your Amazon FSx file system. Make sure that the file system is accessible to all nodes in the cluster.

1. Build the virtual Conda environment by launching a single node slurm job as follows.

   ```
   $ srun -N 1 /path_to/create_conda_env.sh
   ```

1. After the environment is built, you can launch a training job by pointing to the environment path on the shared volume. You can launch both single-node and multi-node training jobs with the same setup. To launch a job, create a job launcher script (also called an entry point script) as follows.

   ```
   #!/usr/bin/env bash
   set -ex
   
   ENV_PATH=/fsx/users/my_user/pytorch_env
   TORCHRUN=$ENV_PATH/bin/torchrun
   TRAINING_SCRIPT=/fsx/users/my_user/pt_train.py
   
   WORLD_SIZE_JOB=$SLURM_NTASKS
   RANK_NODE=$SLURM_NODEID
   PROC_PER_NODE=8
   MASTER_ADDR=(`scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1`)
   MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
   
   DIST_ARGS="--nproc_per_node=$PROC_PER_NODE \
              --nnodes=$WORLD_SIZE_JOB \
              --node_rank=$RANK_NODE \
              --master_addr=$MASTER_ADDR \
              --master_port=$MASTER_PORT \
             "
             
   $TORCHRUN $DIST_ARGS $TRAINING_SCRIPT
   ```
**Tip**  
If you want to make your training job more resilient against hardware failures by using the auto-resume capability of SageMaker HyperPod, you need to properly set up the environment variable `MASTER_ADDR` in the entrypoint script. To learn more, see [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md).

   This tutorial assumes that this script is saved as `/fsx/users/my_user/train.sh`.

1. With this script in the shared volume at `/fsx/users/my_user/train.sh`, run the following `srun` command to schedule the Slurm job.

   ```
   $ cd /fsx/users/my_user/
   $ srun -N 8 train.sh
   ```

# Running Docker containers on a Slurm compute node on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-docker"></a>

To run Docker containers with Slurm on SageMaker HyperPod, you need to use [Enroot](https://github.com/NVIDIA/enroot) and [Pyxis](https://github.com/NVIDIA/pyxis). The Enroot package helps convert Docker images into a runtime that Slurm can understand, while the Pyxis enables scheduling the runtime as a Slurm job through an `srun` command, `srun --container-image=docker/image:tag`. 

**Tip**  
The Docker, Enroot, and Pyxis packages should be installed during cluster creation as part of running the lifecycle scripts as guided in [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md). Use the [base lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) provided by the HyperPod service team when creating a HyperPod cluster. Those base scripts are set up to install the packages by default. In the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) script, there's the `Config` class with the boolean type parameter for installing the packages set to `True` (`enable_docker_enroot_pyxis=True`). This is called by and parsed in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script, which calls `install_docker.sh` and `install_enroot_pyxis.sh` scripts from the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The installation scripts are where the actual installations of the packages take place. Additionally, the installation scripts identify if they can detect NVMe store paths from the instances they are run on and set up the root paths for Docker and Enroot to `/opt/dlami/nvme`. The default root volume of any fresh instance is mounted to `/tmp` only with a 100GB EBS volume, which runs out if the workload you plan to run involves training of LLMs and thus large size Docker containers. If you use instance families such as P and G with local NVMe storage, you need to make sure that you use the NVMe storage attached at `/opt/dlami/nvme`, and the installation scripts take care of the configuration processes.

**To check if the root paths are set up properly**

On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to make sure that the lifecycle script worked properly and the root volume of each node is set to `/opt/dlami/nvme/*`. The following commands shows examples of checking the Enroot runtime path and the data root path for 8 compute nodes of a Slurm cluster.

```
$ srun -N 8 cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH"
ENROOT_RUNTIME_PATH        /opt/dlami/nvme/tmp/enroot/user-$(id -u)
... // The same or similar lines repeat 7 times
```

```
$ srun -N 8 cat /etc/docker/daemon.json
{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}
... // The same or similar lines repeat 7 times
```

After you confirm that the runtime paths are properly set to `/opt/dlami/nvme/*`, you're ready to build and run Docker containers with Enroot and Pyxis.

**To test Docker with Slurm**

1. On your compute node, try the following commands to check if Docker and Enroot are properly installed.

   ```
   $ docker --help
   $ enroot --help
   ```

1. Test if Pyxis and Enroot installed correctly by running one of the [NVIDIA CUDA Ubuntu](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda) images.

   ```
   $ srun --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

   You can also test it by creating a script and running an `sbatch` command as follows.

   ```
   $ cat <<EOF >> container-test.sh
   #!/bin/bash
   #SBATCH --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   nvidia-smi
   EOF
   
   $ sbatch container-test.sh
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

**To run a test Slurm job with Docker**

After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker AI model parallelism (SMP) library.

1. If you want to use one of the pre-built ECR images distributed by SageMaker AI or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). In this tutorial, we use the [SMP Docker image](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2) pre-packaged with the SMP library .

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "ecr:BatchCheckLayerAvailability",
                   "ecr:BatchGetImage",
                   "ecr-public:*",
                   "ecr:GetDownloadUrlForLayer",
                   "ecr:GetAuthorizationToken",
                   "sts:*"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.

   ```
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   $ cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2
   ```

1. In this tutorial, run the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh) that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want.

   ```
   $ cat docker_build.sh
   #!/usr/bin/env bash
   
   region=us-west-2
   dlc_account_id=658645717510
   aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
   
   docker build -t smpv2 .
   enroot import -o smpv2.sqsh  dockerd://smpv2:latest
   ```

   ```
   $ bash docker_build.sh
   ```

1. Create a batch script to launch a training job using `sbatch`. In this tutorial, the provided sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh) launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts), and `launch_training_enroot.sh` takes `train_external.py` as the entrypoint script.
**Important**  
To use the a Docker container on SageMaker HyperPod, you must mount the `/var/log` directory from the host machine, which is the HyperPod compute node in this case, onto the `/var/log` directory in the container. You can set it up by adding the following variable for Enroot.  

   ```
   "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}"
   ```

   ```
   $ cat launch_training_enroot.sh
   #!/bin/bash
   
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier: MIT-0
   
   #SBATCH --nodes=8 # number of nodes to use, 2 p4d(e) = 16 A100 GPUs
   #SBATCH --job-name=smpv2_llama # name of your job
   #SBATCH --exclusive # job has exclusive use of the resource, no sharing
   #SBATCH --wait-all-nodes=1
   
   set -ex;
   
   ###########################
   ###### User Variables #####
   ###########################
   
   #########################
   model_type=llama_v2
   model_size=70b
   
   # Toggle this to use synthetic data
   use_synthetic_data=1
   
   
   # To run training on your own data  set Training/Test Data path  -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines.
   # Also change the use_synthetic_data to 0
   
   export TRAINING_DIR=/fsx/path_to_data
   export TEST_DIR=/fsx/path_to_data
   export CHECKPOINT_DIR=$(pwd)/checkpoints
   
   # Variables for Enroot
   : "${IMAGE:=$(pwd)/smpv2.sqsh}"
   : "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" # This is needed for validating its hyperpod cluster
   : "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}"
   : "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}"
   : "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}"   
   
   
   ###########################
   ## Environment Variables ##
   ###########################
   
   #export NCCL_SOCKET_IFNAME=en
   export NCCL_ASYNC_ERROR_HANDLING=1
   
   export NCCL_PROTO="simple"
   export NCCL_SOCKET_IFNAME="^lo,docker"
   export RDMAV_FORK_SAFE=1
   export FI_EFA_USE_DEVICE_RDMA=1
   export NCCL_DEBUG_SUBSYS=off
   export NCCL_DEBUG="INFO"
   export SM_NUM_GPUS=8
   export GPU_NUM_DEVICES=8
   export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
   
   # async runtime error ...
   export CUDA_DEVICE_MAX_CONNECTIONS=1
   
   
   #########################
   ## Command and Options ##
   #########################
   
   if [ "$model_size" == "7b" ]; then
       HIDDEN_WIDTH=4096
       NUM_LAYERS=32
       NUM_HEADS=32
       LLAMA_INTERMEDIATE_SIZE=11008
       DEFAULT_SHARD_DEGREE=8
   # More Llama model size options
   elif [ "$model_size" == "70b" ]; then
       HIDDEN_WIDTH=8192
       NUM_LAYERS=80
       NUM_HEADS=64
       LLAMA_INTERMEDIATE_SIZE=28672
       # Reduce for better perf on p4de
       DEFAULT_SHARD_DEGREE=64
   fi
   
   
   if [ -z "$shard_degree" ]; then
       SHARD_DEGREE=$DEFAULT_SHARD_DEGREE
   else
       SHARD_DEGREE=$shard_degree
   fi
   
   if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then
       LLAMA_ARGS=""
   else
       LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE "
   fi
   
   
   if [ $use_synthetic_data == 1 ]; then
       echo "using synthetic data"
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH
       )
   else
       echo "using real data...."
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH
       )
   fi
   
   
   declare -a TORCHRUN_ARGS=(
       # change this to match the number of gpus per node:
       --nproc_per_node=8 \
       --nnodes=$SLURM_JOB_NUM_NODES \
       --rdzv_id=$SLURM_JOB_ID \
       --rdzv_backend=c10d \
       --rdzv_endpoint=$(hostname) \
   )
   
   srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" /path_to/train_external.py \
               --train_batch_size 4 \
               --max_steps 100 \
               --hidden_width $HIDDEN_WIDTH \
               --num_layers $NUM_LAYERS \
               --num_heads $NUM_HEADS \
               ${LLAMA_ARGS} \
               --shard_degree $SHARD_DEGREE \
               --model_type $model_type \
               --profile_nsys 1 \
               --use_smp_implementation 1 \
               --max_context_width 4096 \
               --tensor_parallel_degree 1 \
               --use_synthetic_data $use_synthetic_data \
               --training_dir $TRAINING_DIR \
               --test_dir $TEST_DIR \
               --dataset_type hf \
               --checkpoint_dir $CHECKPOINT_DIR \
               --checkpoint_freq 100 \
   
   $ sbatch launch_training_enroot.sh
   ```

To find the downloadable code examples, see [Run a model-parallel training job using the SageMaker AI model parallelism library, Docker and Enroot with Slurm](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2#option-2----run-training-using-docker-and-enroot) in the *Awsome Distributed Training GitHub repository*. For more information about distributed training with a Slurm cluster on SageMaker HyperPod, proceed to the next topic at [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

# Running distributed training workloads with Slurm on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload"></a>

SageMaker HyperPod is specialized for workloads of training large language models (LLMs) and foundation models (FMs). These workloads often require the use of multiple parallelism techniques and optimized operations for ML infrastructure and resources. Using SageMaker HyperPod, you can use the following SageMaker AI distributed training frameworks:
+ The [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md) that offers collective communication operations optimized for AWS.
+ The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) that implements various model parallelism techniques.

**Topics**
+ [Using SMDDP on a SageMaker HyperPod](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp)
+ [Using SMP on a SageMaker HyperPod cluster](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp)

## Using SMDDP on a SageMaker HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp"></a>

The [SMDDP library](data-parallel.md) is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library works with the following open source distributed training frameworks:
+ [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html)
+ [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html)
+ [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+ [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)

The SMDDP library addresses communications overhead of the key collective communication operations by offering the following for SageMaker HyperPod.
+ The library offers `AllGather` optimized for AWS. `AllGather` is a key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries. These include the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).
+ The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and the SageMaker AI ML instance topology. 

**To run sample data-parallel training jobs**

Explore the following distributed training samples implementing data parallelism techniques using the SMDDP library.
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP)
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed)

**To set up an environment for using the SMDDP library on SageMaker HyperPod**

The following are training environment requirements for using the SMDDP library on SageMaker HyperPod.
+ PyTorch v2.0.1 and later
+ CUDA v11.8 and later
+ `libstdc++` runtime version greater than 3
+ Python v3.10.x and later
+ `ml.p4d.24xlarge` and `ml.p4de.24xlarge`, which are supported instance types by the SMDDP library
+ `imdsv2` enabled on training host

Depending on how you want to run the distributed training job, there are two options to install the SMDDP library:
+ A direct installation using the SMDDP binary file.
+ Using the SageMaker AI Deep Learning Containers (DLCs) pre-installed with the SMDDP library.

Docker images pre-installed with the SMDDP library or the URLs to the SMDDP binary files are listed at [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation.

**To install the SMDDP library on the SageMaker HyperPod DLAMI**
+ `pip install --no-cache-dir https://smdataparallel.s3.amazonaws.com/binary/pytorch/<pytorch-version>/cuXYZ/YYYY-MM-DD/smdistributed_dataparallel-X.Y.Z-cp310-cp310-linux_x86_64.whl`
**Note**  
If you work in a Conda environment, ensure that you install PyTorch using `conda install` instead of `pip`.  

  ```
  conda install pytorch==X.Y.Z  torchvision==X.Y.Z torchaudio==X.Y.Z pytorch-cuda=X.Y.Z -c pytorch -c nvidia
  ```

**To use the SMDDP library on a Docker container**
+ The SMDDP library is pre-installed on the SageMaker AI Deep Learning Containers (DLCs). To find the list of SageMaker AI framework DLCs for PyTorch with the SMDDP library, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation. You can also bring your own Docker container with required dependencies installed to use the SMDDP library. To learn more about setting up a custom Docker container to use the SMDDP library, see also [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).
**Important**  
To use the SMDDP library in a Docker container, mount the `/var/log` directory from the host machine onto `/var/log` in the container. This can be done by adding the following option when running your container.  

  ```
  docker run <OTHER_OPTIONS> -v /var/log:/var/log ...
  ```

To learn how to run data-parallel training jobs with SMDDP in general, see [Distributed training with the SageMaker AI distributed data parallelism library](data-parallel-modify-sdp.md).

## Using SMP on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp"></a>

The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) offers various [state-of-the-art model parallelism techniques](model-parallel-core-features-v2.md), including:
+ fully sharded data parallelism
+ expert parallelism
+ mixed precision training with FP16/BF16 and FP8 data types
+ tensor parallelism

The SMP library is also compatible with open source frameworks such as PyTorch FSDP, NVIDIA Megatron, and NVIDIA Transformer Engine.

**To run a sample model-parallel training workload**

The SageMaker AI service teams provide sample training jobs implementing model parallelism with the SMP library at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2).