

本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 使用 p6e-gb200 執行個體支援 NVIDIA-Imex
<a name="support-nvidia-imex-p6e-gb200-instance"></a>

本教學課程說明如何在 P6e-GB200 AWS ParallelCluster 上開始使用 ，以利用最高 GPU 效能進行 AI 訓練和推論。[p6e-gb200.36xlarge 執行個體只能透過 P6e-GB200 UltraServers 使用](https://aws.amazon.com/ec2/instance-types/p6/)，其中 `u-p6e-gb200x72`是 Ultraserver Size，而 `p6e-gb200.36xlarge`是 [InstanceType](Scheduling-v3.md#yaml-Scheduling-SlurmQueues-ComputeResources-InstanceType)，其會形成超伺服器。購買 Ultraserver 時`u-p6e-gb200x72`，可透過具有 18 個`p6e-gb200.36xlarge`執行個體[的適用於 ML 的 EC2 容量區塊](https://aws.amazon.com/ec2/capacityblocks/)取得。若要進一步了解，請參閱 [P6e-GB200](https://aws.amazon.com/ec2/instance-types/p6/)。

AWS ParallelCluster 3.14.0 版：
+ 提供此執行個體類型所需的完整 NVIDIA 軟體堆疊 （驅動程式、CUDA、EFA、NVIDIA-IMEX) 
+ 為 P6e-GB200 Ultraserver 建立 nvidia-imex 組態
+ 啟用和啟動 P6e-GB200 Ultraserver `nvidia-imex`的服務
+ 設定 Slurm Block 拓撲外掛程式，讓每個 P6e-GB200 Ultraserver (EC2 容量區塊） 都是大小正確的 Slurm Block （請參閱 3.14.0 版[版本備註和文件歷史記錄](document_history.md)的項目）。

不過，透過 NVLink 的 GPU-to-GPU 通訊需要額外的組態，特別是包含 ParallelCluster 不會自動產生之 IMEX 網域中運算節點 IP 地址[https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location)的檔案。為了協助產生此檔案，我們提供 Prolog 指令碼，可自動探索運算節點 IPs，並設定[https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location)下列 [ NVIDIA IMEX Slurm 任務排程器整合](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html#job-scheduler-integration)建議。本教學課程會逐步解說如何建立 prolog 指令碼、透過 HeadNode 自訂動作部署，以及驗證 IMEX 設定。

**注意**  
從 Amazon Linux 2023、Ubuntu 22.04 和 Ubuntu 24.04 的 v3.14.0 開始，支援 P6e-GB200。 AWS ParallelCluster 如需詳細的軟體版本和更新的支援分佈清單，請參閱[AWS ParallelCluster 變更日誌](https://github.com/aws/aws-parallelcluster/blob/develop/CHANGELOG.md)。

## 建立 Prolog 指令碼以管理 NVIDIA-Imex
<a name="support-nvidia-imex-p6e-gb200-instance-prolog"></a>

**限制：**
+ 此 prolog 指令碼將在提交專屬任務時執行。這是為了確保 IMEX 重新啟動不會中斷屬於 IMEX 網域之 p6e-Gb200 節點上的任何執行中任務。

以下是您應該在 Slurm 中設定為 Prolog 的`91_nvidia_imex_prolog.sh`指令碼。它用於自動更新運算節點上的 nvidia-imex 組態。指令碼名稱的字首為 `91`，以遵循 [ SchedMD 的命名慣例](https://slurm.schedmd.com/prolog_epilog.html)。這可確保它在序列中的任何其他 prolog 指令碼之前執行。指令碼會在任務啟動時重新設定 NVIDIA Imex 節點的組態，並重新載入必要的 NVIDIA 協助程式。

**注意**  
如果相同節點上同時啟動多個任務，則不會執行此指令碼，因此我們建議在提交時使用 `--exclusive`旗標。

```
#!/usr/bin/env bash

# This prolog script configures the NVIDIA IMEX on compute nodes involved in the job execution.
#
# In particular:
# - Checks whether the job is executed exclusively.
#   If not, it exits immediately because it requires jobs to be executed exclusively.
# - Checks if it is running on a p6e-gb200 instance type.
#   If not, it exits immediately because IMEX must be configured only on that instance type.
# - Checks if the IMEX service is enabled.
#   If not, it exits immediately because IMEX must be enabled to get configured.
# - Creates the IMEX default channel.
#   For more information about IMEX channels, see https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html
# - Writes the private IP addresses of compute nodes into /etc/nvidia-imex/nodes_config.cfg.
# - Restarts the IMEX system service.
#
# REQUIREMENTS:
#  - This prolog assumes to be run only with exclusive jobs.

LOG_FILE_PATH="/var/log/parallelcluster/nvidia-imex-prolog.log"
SCONTROL_CMD="/opt/slurm/bin/scontrol"
IMEX_START_TIMEOUT=60
IMEX_STOP_TIMEOUT=15
ALLOWED_INSTANCE_TYPES="^(p6e-gb200)"
IMEX_SERVICE="nvidia-imex"
IMEX_NODES_CONFIG="/etc/nvidia-imex/nodes_config.cfg"

function info() {
  echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [INFO] [PID:$$] [JOB:${SLURM_JOB_ID}] $1"
}

function warn() {
  echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [WARN] [PID:$$] [JOB:${SLURM_JOB_ID}] $1"
}

function error() {
  echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [ERROR] [PID:$$] [JOB:${SLURM_JOB_ID}] $1"
}

function error_exit() {
  error "$1" && exit 1
}

function prolog_end() {
    info "PROLOG End JobId=${SLURM_JOB_ID}: $0"
    info "----------------"
    exit 0
}

function get_instance_type() {
  local token=$(curl -X PUT -s "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
  curl -s -H "X-aws-ec2-metadata-token: ${token}" http://169.254.169.254/latest/meta-data/instance-type
}

function return_if_unsupported_instance_type() {
  local instance_type=$(get_instance_type)

  if [[ ! ${instance_type} =~ ${ALLOWED_INSTANCE_TYPES} ]]; then
    info "Skipping IMEX configuration because instance type ${instance_type} does not support it"
    prolog_end
  fi
}

function return_if_imex_disabled() {
  if ! systemctl is-enabled "${IMEX_SERVICE}" &>/dev/null; then
    warn "Skipping IMEX configuration because system service ${IMEX_SERVICE} is not enabled"
    prolog_end
  fi
}

function return_if_job_is_not_exclusive() {
  if [[ "${SLURM_JOB_OVERSUBSCRIBE}" =~ ^(NO|TOPO)$  ]]; then
    info "Job is exclusive, proceeding with IMEX configuration"
  else
    info "Skipping IMEX configuration because the job is not exclusive"
    prolog_end
  fi
}

function get_ips_from_node_names() {
  local _nodes=$1
  ${SCONTROL_CMD} -ao show node "${_nodes}" | sed 's/^.* NodeAddr=\([^ ]*\).*/\1/'
}

function get_compute_resource_name() {
  local _queue_name_prefix=$1
  local _slurmd_node_name=$2
  echo "${_slurmd_node_name}" | sed -E "s/${_queue_name_prefix}(.+)-[0-9]+$/\1/"
}

function reload_imex() {
  info "Stopping IMEX"
  timeout ${IMEX_STOP_TIMEOUT} systemctl stop ${IMEX_SERVICE}
  pkill -9 ${IMEX_SERVICE}

  info "Restarting IMEX"
  timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE}
}

function create_default_imex_channel() {
  info "Creating IMEX default channel"
  MAJOR_NUMBER=$(cat /proc/devices | grep nvidia-caps-imex-channels | cut -d' ' -f1)
  if [ ! -d "/dev/nvidia-caps-imex-channels" ]; then
    sudo mkdir /dev/nvidia-caps-imex-channels
  fi

  # Then check and create device node
  if [ ! -e "/dev/nvidia-caps-imex-channels/channel0" ]; then
    sudo mknod /dev/nvidia-caps-imex-channels/channel0 c $MAJOR_NUMBER 0
    info "IMEX default channel created"
  else
    info "IMEX default channel already exists"
  fi
}

{
  info "PROLOG Start JobId=${SLURM_JOB_ID}: $0"

  return_if_job_is_not_exclusive
  return_if_unsupported_instance_type
  return_if_imex_disabled

  create_default_imex_channel

  IPS_FROM_CR=$(get_ips_from_node_names "${SLURM_NODELIST}")

  info "Node Names: ${SLURM_NODELIST}"
  info "Node IPs: ${IPS_FROM_CR}"
  info "IMEX Nodes Config: ${IMEX_NODES_CONFIG}"

  info "Updating IMEX nodes config ${IMEX_NODES_CONFIG}"
  echo "${IPS_FROM_CR}" > "${IMEX_NODES_CONFIG}"
  reload_imex

  prolog_end

} 2>&1 | tee -a "${LOG_FILE_PATH}" | logger -t "91_nvidia_imex_prolog"
```

## 建立 HeadNode OnNodeStart 自訂動作指令碼
<a name="support-nvidia-imex-p6e-gb200-instance-action-script"></a>

建立`install_custom_action.sh`自訂動作，其將在運算節點存取的共用目錄中下載上述 prolog 指令碼`/opt/slurm/etc/scripts/prolog.d/`，並設定要執行的適當許可。

```
#!/bin/bash
set -e

echo "Executing $0"

PROLOG_NVIDIA_IMEX=/opt/slurm/etc/scripts/prolog.d/91_nvidia_imex_prolog.sh
aws s3 cp "s3://<Bucket>/91_nvidia_imex_prolog.sh" "${PROLOG_NVIDIA_IMEX}"
chmod 0755 "${PROLOG_NVIDIA_IMEX}"
```

## 建立叢集
<a name="support-nvidia-imex-p6e-gb200-instance-cluster"></a>

建立包含 P6e-GB200 執行個體的叢集。您可以在下面找到包含 Ultraserver 類型 的 SlurmQueues 的範例組態`u-p6e-gb200x72`。

P6e-GB200 目前僅適用於 Local Zones。有些 [Local Zones 不支援 NAT Gateway](https://docs.aws.amazon.com/local-zones/latest/ug/local-zones-connectivity-nat.html)，因此請遵循 [Local Zones 的連線選項](https://docs.aws.amazon.com/local-zones/latest/ug/local-zones-connectivity.html)，因為 ParallelCluster [設定受限環境的安全群組](security-groups-configuration.md)需要連線到 AWS 服務。請遵循 [使用容量區塊 (CB) 啟動執行個體](launch-instances-capacity-blocks.md)(AWS ParallelClusterLaunch)，因為 Ultraserver 僅供容量區塊使用。

```
HeadNode:
  CustomActions:
    OnNodeStart:
      Script: s3://<s3-bucket-name>/install_custom_action.sh
    S3Access:
      - BucketName: <s3-bucket-name>
  InstanceType: <HeadNode-instance-type>
  Networking:
    SubnetId: <subnet-abcd78901234567890>
  Ssh:
    KeyName: <Key-name>
Image:
  Os: ubuntu2404
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    CustomSlurmSettings:
      - PrologFlags: "Alloc,NoHold"
      - MessageTimeout: 240
  SlurmQueues:
    - CapacityReservationTarget:
        CapacityReservationId: <cr-123456789012345678>
      CapacityType: CAPACITY_BLOCK
      ComputeResources: ### u-p6e-gb200x72
        - DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
          InstanceType: p6e-gb200.36xlarge  
          MaxCount: 18
          MinCount: 18
          Name: cr1
      Name: q1
      Networking:
        SubnetIds:
          - <subnet-1234567890123456>
```

## 驗證 IMEX 設定
<a name="support-nvidia-imex-p6e-gb200-instance-validate"></a>

當您提交 Slurm 任務時，將會執行 `91_nvidia_imex_prolog.sh` prolog。以下是檢查 NVIDIA-imex 網域狀態的範例任務。

```
#!/bin/bash
#SBATCH --job-name=nvidia-imex-status-job
#SBATCH --ntasks-per-node=1
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

QUEUE_NAME="q1"
COMPUTE_RES_NAME="cr1"
IMEX_CONFIG_FILE="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RES_NAME}.cfg"

srun bash -c "/usr/bin/nvidia-imex-ctl -N -c ${IMEX_CONFIG_FILE} > result_\${SLURM_JOB_ID}_\$(hostname).out 2> result_\${SLURM_JOB_ID}_\$(hostname).err"
```

檢查任務的輸出：

```
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
C - Connected - Ready for operation

5/12/2025 06:08:10.580
Nodes:
Node #0   - 172.31.48.81    - READY                - Version: 570.172
Node #1   - 172.31.48.98    - READY                - Version: 570.172
Node #2   - 172.31.48.221   - READY                - Version: 570.172
Node #3   - 172.31.49.228   - READY                - Version: 570.172
Node #4   - 172.31.50.39    - READY                - Version: 570.172
Node #5   - 172.31.50.44    - READY                - Version: 570.172
Node #6   - 172.31.51.66    - READY                - Version: 570.172
Node #7   - 172.31.51.157   - READY                - Version: 570.172
Node #8   - 172.31.52.239   - READY                - Version: 570.172
Node #9   - 172.31.53.80    - READY                - Version: 570.172
Node #10  - 172.31.54.95    - READY                - Version: 570.172
Node #11  - 172.31.54.183   - READY                - Version: 570.172
Node #12  - 172.31.54.203   - READY                - Version: 570.172
Node #13  - 172.31.54.241   - READY                - Version: 570.172
Node #14  - 172.31.55.59    - READY                - Version: 570.172
Node #15  - 172.31.55.187   - READY                - Version: 570.172
Node #16  - 172.31.55.197   - READY                - Version: 570.172
Node #17  - 172.31.56.47    - READY                - Version: 570.172

 Nodes From\To  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17
       0        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       1        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       2        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
       3        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
       4        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       5        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       6        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
       7        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       8        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       9        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      10        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      11        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      12        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      13        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      14        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
      15        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
      16        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
      17        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  

Domain State: UP
```