

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 使用 p6e-gb200 实例支持 nvidia-imex
<a name="support-nvidia-imex-p6e-gb200-instance"></a>

本教程向您展示如何开始使用 AWS ParallelCluster p6e-GB200，以利用最高的 GPU 性能进行 AI 训练和推理。 [p6e-gb200.36xlarge 实例只能通过 p6e 获得，GB200 UltraServers其中 P6e 是 Ultraserver 大](https://aws.amazon.com/ec2/instance-types/p6/)小，构成超级`u-p6e-gb200x72`服务器。`p6e-gb200.36xlarge` [InstanceType](Scheduling-v3.md#yaml-Scheduling-SlurmQueues-ComputeResources-InstanceType)购买 Ultraserver 后`u-p6e-gb200x72`，它将通过适用于 [ML 的 EC2 容量块](https://aws.amazon.com/ec2/capacityblocks/)提供，该容量块将有 18 `p6e-gb200.36xlarge` 个实例。要了解更多信息，请参阅 [p6e-。GB200](https://aws.amazon.com/ec2/instance-types/p6/)

AWS ParallelCluster 版本 3.14.0：
+ 提供此实例类型所需的完整 NVIDIA 软件堆栈（驱动程序、CUDA、EFA、NVIDIA-IMEX） 
+ 为 p6e-ultraserver 创建 nvidia-imex 配置 GB200 
+ 启用并启动 p6e-Ultraserver 的`nvidia-imex`服务 GB200 
+ 配置 Slurm Block 拓扑插件，使每台 p6e-GB200 Ultraserver（EC2 容量块）都是大小合适的 Slurm Block（参见 3.14.0 版本的条目）。[发布说明和文档历史记录](document_history.md)

但是，通过 GPU-to-GPU通信 NVLink 需要额外的配置，特别是包含 IMEX 域中计算节点的 IP 地址的[https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location)文件，该文件 ParallelCluster 不会自动生成。为了帮助生成此文件，我们提供了一个序言脚本，该脚本可以自动发现计算节点 IPs 并配置以[https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/config.html#imex-service-node-configuration-file-location)下 [NVIDIA IMEX Slurm Job Scheduler 集成建议](https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/deployment.html#job-scheduler-integration)。本教程将向您介绍如何创建 prolog 脚本、如何通过 HeadNode 自定义操作进行部署以及验证 IMEX 设置。

**注意**  
亚马逊 Linux 2023、Ubun GB200 tu 22.04 和 Ubuntu 24 AWS ParallelCluster .04 从 v3.14.0 开始支持 p6e-。有关详细的软件版本和支持的发行版的更新列表，请参阅[AWS ParallelCluster 变更日志](https://github.com/aws/aws-parallelcluster/blob/develop/CHANGELOG.md)。

## 创建用于管理 nvidia-imex 的 Prolog 脚本
<a name="support-nvidia-imex-p6e-gb200-instance-prolog"></a>

**局限性：**
+ 此 prolog 脚本将在提交专属作业时运行。这是为了确保 IMEX 重新启动不会中断属于 IMEX 域的 P6e-GB200 节点上正在运行的任何作业。

以下是你应该在 Slurm 中配置为序言的`91_nvidia_imex_prolog.sh`脚本。它用于自动更新计算节点上的 nvidia-imex 配置。脚本名称的前缀为，`91`以符合 S [chedMD 的命名惯例](https://slurm.schedmd.com/prolog_epilog.html)。这样可以确保它在序列中的任何其他 prolog 脚本之前执行。该脚本会在任务启动时重新配置 NVIDIA Imex 节点的配置，并重新加载必要的 NVIDIA 守护程序。

**注意**  
如果在同一节点上同时启动多个作业，则不会执行此脚本，因此我们建议在提交时使用该`--exclusive`标志。

```
#!/usr/bin/env bash

# This prolog script configures the NVIDIA IMEX on compute nodes involved in the job execution.
#
# In particular:
# - Checks whether the job is executed exclusively.
#   If not, it exits immediately because it requires jobs to be executed exclusively.
# - Checks if it is running on a p6e-gb200 instance type.
#   If not, it exits immediately because IMEX must be configured only on that instance type.
# - Checks if the IMEX service is enabled.
#   If not, it exits immediately because IMEX must be enabled to get configured.
# - Creates the IMEX default channel.
#   For more information about IMEX channels, see https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html
# - Writes the private IP addresses of compute nodes into /etc/nvidia-imex/nodes_config.cfg.
# - Restarts the IMEX system service.
#
# REQUIREMENTS:
#  - This prolog assumes to be run only with exclusive jobs.

LOG_FILE_PATH="/var/log/parallelcluster/nvidia-imex-prolog.log"
SCONTROL_CMD="/opt/slurm/bin/scontrol"
IMEX_START_TIMEOUT=60
IMEX_STOP_TIMEOUT=15
ALLOWED_INSTANCE_TYPES="^(p6e-gb200)"
IMEX_SERVICE="nvidia-imex"
IMEX_NODES_CONFIG="/etc/nvidia-imex/nodes_config.cfg"

function info() {
  echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [INFO] [PID:$$] [JOB:${SLURM_JOB_ID}] $1"
}

function warn() {
  echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [WARN] [PID:$$] [JOB:${SLURM_JOB_ID}] $1"
}

function error() {
  echo "$(date "+%Y-%m-%dT%H:%M:%S.%3N") [ERROR] [PID:$$] [JOB:${SLURM_JOB_ID}] $1"
}

function error_exit() {
  error "$1" && exit 1
}

function prolog_end() {
    info "PROLOG End JobId=${SLURM_JOB_ID}: $0"
    info "----------------"
    exit 0
}

function get_instance_type() {
  local token=$(curl -X PUT -s "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
  curl -s -H "X-aws-ec2-metadata-token: ${token}" http://169.254.169.254/latest/meta-data/instance-type
}

function return_if_unsupported_instance_type() {
  local instance_type=$(get_instance_type)

  if [[ ! ${instance_type} =~ ${ALLOWED_INSTANCE_TYPES} ]]; then
    info "Skipping IMEX configuration because instance type ${instance_type} does not support it"
    prolog_end
  fi
}

function return_if_imex_disabled() {
  if ! systemctl is-enabled "${IMEX_SERVICE}" &>/dev/null; then
    warn "Skipping IMEX configuration because system service ${IMEX_SERVICE} is not enabled"
    prolog_end
  fi
}

function return_if_job_is_not_exclusive() {
  if [[ "${SLURM_JOB_OVERSUBSCRIBE}" =~ ^(NO|TOPO)$  ]]; then
    info "Job is exclusive, proceeding with IMEX configuration"
  else
    info "Skipping IMEX configuration because the job is not exclusive"
    prolog_end
  fi
}

function get_ips_from_node_names() {
  local _nodes=$1
  ${SCONTROL_CMD} -ao show node "${_nodes}" | sed 's/^.* NodeAddr=\([^ ]*\).*/\1/'
}

function get_compute_resource_name() {
  local _queue_name_prefix=$1
  local _slurmd_node_name=$2
  echo "${_slurmd_node_name}" | sed -E "s/${_queue_name_prefix}(.+)-[0-9]+$/\1/"
}

function reload_imex() {
  info "Stopping IMEX"
  timeout ${IMEX_STOP_TIMEOUT} systemctl stop ${IMEX_SERVICE}
  pkill -9 ${IMEX_SERVICE}

  info "Restarting IMEX"
  timeout ${IMEX_START_TIMEOUT} systemctl start ${IMEX_SERVICE}
}

function create_default_imex_channel() {
  info "Creating IMEX default channel"
  MAJOR_NUMBER=$(cat /proc/devices | grep nvidia-caps-imex-channels | cut -d' ' -f1)
  if [ ! -d "/dev/nvidia-caps-imex-channels" ]; then
    sudo mkdir /dev/nvidia-caps-imex-channels
  fi

  # Then check and create device node
  if [ ! -e "/dev/nvidia-caps-imex-channels/channel0" ]; then
    sudo mknod /dev/nvidia-caps-imex-channels/channel0 c $MAJOR_NUMBER 0
    info "IMEX default channel created"
  else
    info "IMEX default channel already exists"
  fi
}

{
  info "PROLOG Start JobId=${SLURM_JOB_ID}: $0"

  return_if_job_is_not_exclusive
  return_if_unsupported_instance_type
  return_if_imex_disabled

  create_default_imex_channel

  IPS_FROM_CR=$(get_ips_from_node_names "${SLURM_NODELIST}")

  info "Node Names: ${SLURM_NODELIST}"
  info "Node IPs: ${IPS_FROM_CR}"
  info "IMEX Nodes Config: ${IMEX_NODES_CONFIG}"

  info "Updating IMEX nodes config ${IMEX_NODES_CONFIG}"
  echo "${IPS_FROM_CR}" > "${IMEX_NODES_CONFIG}"
  reload_imex

  prolog_end

} 2>&1 | tee -a "${LOG_FILE_PATH}" | logger -t "91_nvidia_imex_prolog"
```

## 创建 HeadNode OnNodeStart 自定义操作脚本
<a name="support-nvidia-imex-p6e-gb200-instance-action-script"></a>

创建一个`install_custom_action.sh`自定义操作，该操作将在共享目录中下载前面提到的 prolog 脚本，`/opt/slurm/etc/scripts/prolog.d/`该目录可由 Compute Nodes 访问，并设置要执行的适当权限。

```
#!/bin/bash
set -e

echo "Executing $0"

PROLOG_NVIDIA_IMEX=/opt/slurm/etc/scripts/prolog.d/91_nvidia_imex_prolog.sh
aws s3 cp "s3://<Bucket>/91_nvidia_imex_prolog.sh" "${PROLOG_NVIDIA_IMEX}"
chmod 0755 "${PROLOG_NVIDIA_IMEX}"
```

## 创建集群
<a name="support-nvidia-imex-p6e-gb200-instance-cluster"></a>

创建包含 P6e 实例的集群。GB200 您可以在下面找到包含 Ultraserver SlurmQueues 类型的配置示例。`u-p6e-gb200x72`

P6e-目前GB200 仅在 Local Zones 中可用。有些 L [ocal Zones 不支持 NAT 网关](https://docs.aws.amazon.com/local-zones/latest/ug/local-zones-connectivity-nat.html)，因此请根据 ParallelCluster 需要遵循[本地区域的连接选项](https://docs.aws.amazon.com/local-zones/latest/ug/local-zones-connectivity.html)[为受限环境配置安全组](security-groups-configuration.md)来连接 AWS 服务。请按照 [使用容量块（CB）启动实例](launch-instances-capacity-blocks.md) (AWS ParallelClusterLaunch) 进行操作，因为 Ultraserver 只能作为容量块使用。

```
HeadNode:
  CustomActions:
    OnNodeStart:
      Script: s3://<s3-bucket-name>/install_custom_action.sh
    S3Access:
      - BucketName: <s3-bucket-name>
  InstanceType: <HeadNode-instance-type>
  Networking:
    SubnetId: <subnet-abcd78901234567890>
  Ssh:
    KeyName: <Key-name>
Image:
  Os: ubuntu2404
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    CustomSlurmSettings:
      - PrologFlags: "Alloc,NoHold"
      - MessageTimeout: 240
  SlurmQueues:
    - CapacityReservationTarget:
        CapacityReservationId: <cr-123456789012345678>
      CapacityType: CAPACITY_BLOCK
      ComputeResources: ### u-p6e-gb200x72
        - DisableSimultaneousMultithreading: true
          Efa:
            Enabled: true
          InstanceType: p6e-gb200.36xlarge  
          MaxCount: 18
          MinCount: 18
          Name: cr1
      Name: q1
      Networking:
        SubnetIds:
          - <subnet-1234567890123456>
```

## 验证 IMEX 设置
<a name="support-nvidia-imex-p6e-gb200-instance-validate"></a>

当你提交 S `91_nvidia_imex_prolog.sh` lurm 作业时，序言就会运行。以下是检查 NVidia-imex 域名状态的示例任务。

```
#!/bin/bash
#SBATCH --job-name=nvidia-imex-status-job
#SBATCH --ntasks-per-node=1
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

QUEUE_NAME="q1"
COMPUTE_RES_NAME="cr1"
IMEX_CONFIG_FILE="/opt/parallelcluster/shared/nvidia-imex/config_${QUEUE_NAME}_${COMPUTE_RES_NAME}.cfg"

srun bash -c "/usr/bin/nvidia-imex-ctl -N -c ${IMEX_CONFIG_FILE} > result_\${SLURM_JOB_ID}_\$(hostname).out 2> result_\${SLURM_JOB_ID}_\$(hostname).err"
```

查看 Job 的输出：

```
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
C - Connected - Ready for operation

5/12/2025 06:08:10.580
Nodes:
Node #0   - 172.31.48.81    - READY                - Version: 570.172
Node #1   - 172.31.48.98    - READY                - Version: 570.172
Node #2   - 172.31.48.221   - READY                - Version: 570.172
Node #3   - 172.31.49.228   - READY                - Version: 570.172
Node #4   - 172.31.50.39    - READY                - Version: 570.172
Node #5   - 172.31.50.44    - READY                - Version: 570.172
Node #6   - 172.31.51.66    - READY                - Version: 570.172
Node #7   - 172.31.51.157   - READY                - Version: 570.172
Node #8   - 172.31.52.239   - READY                - Version: 570.172
Node #9   - 172.31.53.80    - READY                - Version: 570.172
Node #10  - 172.31.54.95    - READY                - Version: 570.172
Node #11  - 172.31.54.183   - READY                - Version: 570.172
Node #12  - 172.31.54.203   - READY                - Version: 570.172
Node #13  - 172.31.54.241   - READY                - Version: 570.172
Node #14  - 172.31.55.59    - READY                - Version: 570.172
Node #15  - 172.31.55.187   - READY                - Version: 570.172
Node #16  - 172.31.55.197   - READY                - Version: 570.172
Node #17  - 172.31.56.47    - READY                - Version: 570.172

 Nodes From\To  0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17
       0        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       1        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       2        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
       3        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
       4        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       5        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       6        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C
       7        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       8        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C 
       9        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      10        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      11        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      12        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      13        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   
      14        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
      15        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
      16        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  
      17        C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C   C  

Domain State: UP
```