本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 在计算节点初始化过程中看到错误以下各节提供了计算节点初始化过程中出现错误时的问题排查提示。这包括引导错误、在日志中看到错误，以及如果这些场景都不适用于您的具体情况，该如何处理。 **Topics** + [ # 在 `clustermgtd.log` 中看到“`Node bootstrap error`” ](compute-node-initialization-bootstrap-error-v3.md) + [ # 我配置了按需容量预留 (ODCRs) 或区域预留实例 ](compute-node-initialization-odcr-v3.md) + [ # 运行作业失败时在 `slurm_resume.log` 中看到“`An error occurred (VcpuLimitExceeded)`”，或创建集群失败时在 `clustermgtd.log` 中看到该错误 ](compute-node-initialization-vpc-limit-v3.md) + [ # 运行作业失败时在 `slurm_resume.log` 中看到“`An error occurred (InsufficientInstanceCapacity)`”，或创建集群失败时在 `clustermgtd.log` 中看到该错误 ](compute-node-initialization-ice-failure-v3.md) + [ # 看到节点处于 `DOWN` 状态并显示`Reason (Code:InsufficientInstanceCapacity)...` ](compute-node-initialization-down-nodes-v3.md) + [ # 在 `slurm_resume.log` 中看到“`cannot change locale (en_US.utf-8) because it has an invalid name`” ](compute-node-initialization-locale-v3.md) + [ # 以上情形都不适用于我的情况 ](compute-node-initialization-not-found-v3.md) # 在 `clustermgtd.log` 中看到“`Node bootstrap error`” 该问题与计算节点无法引导有关。有关如何调试集群受保护模式问题的信息，请参阅[如何调试受保护模式](slurm-protected-mode-v3.md#slurm-protected-mode-debug-v3)。 # 我配置了按需容量预留 (ODCRs) 或区域预留实例 ## ODCRs 其中包括具有多个网络接口的实例，例如 p4d、p4de 和 AWS Trainium (Trn) 在集群配置文件中，检查 `HeadNode` 是否位于公有子网中，以及计算节点是否位于私有子网中。 ## ODCRs 是针对 ODCR ### 尽管我已经按照[使用按需容量预留（ODCR）启动实例](launch-instances-odcr-v3.md)中的说明准备好了 `/opt/slurm/etc/pcluster/run_instances_overrides.json`，但仍看到“`Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'.`” 如果您使用的是目标 AWS ParallelCluster 版本3.1.1至3.2.1 ODCRs，并且还使用[运行实例覆盖 JSON 文件](launch-instances-odcr-v3.md)，则可能您的 JSON 文件格式不正确。您可能会在 `clustermgtd.log` 中看到错误，例如下面的错误： ``` Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'. Using default: {} in /var/log/parallelcluster/clustermgtd. ``` 通过运行以下命令验证 JSON 文件格式是否正确： ``` $ echo /opt/slurm/etc/pcluster/run_instances_overrides.json | jq ``` ### 集群创建失败时在 `clustermgtd.log` 中看到“`Found RunInstances parameters override.`”，或运行作业失败时在 `slurm_resume.log` 中看到该错误如果您使用的是[运行实例覆盖 JSON 文件](launch-instances-odcr-v3.md)，请检查是否在 `/opt/slurm/etc/pcluster/run_instances_overrides.json` 文件中正确设置了队列名称和计算资源名称。 ### 运行作业失败时在 `slurm_resume.log` 中看到“`An error occurred (InsufficientInstanceCapacity)`”，或创建集群失败时在 `clustermgtd.log` 中看到该错误 #### 使用 PG-ODCR（置放群组 ODCR）创建具有关联置放群组的 ODCR 时，必须在配置文件中使用相同的置放群组名称。在集群配置中设置相应的[置放群组名称](Scheduling-v3.md#yaml-Scheduling-SlurmQueues-Networking-PlacementGroup)。 #### 使用区域预留实例如果您使用区域预留实例并在集群配置中将 `PlacementGroup`/`Enabled` 设置为 `true`，则可能会看到错误，例如下面的错误： ``` We currently do not have sufficient trn1.32xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get trn1.32xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1e, us-east-1f. ``` 您可能会看到这种情况，因为区域预留实例未放置在同一 UC（或主干）中，这可能会在使用置放群组时导致容量不足错误 (ICEs)。您可以通过在集群配置中禁用 `PlacementGroup` 群组设置来检查这种情况，以确定集群是否可以分配实例。 # 运行作业失败时在 `slurm_resume.log` 中看到“`An error occurred (VcpuLimitExceeded)`”，或创建集群失败时在 `clustermgtd.log` 中看到该错误检查您的账户中正在使用的特定 Amazon EC2 实例类型的 vCPU 限制。如果您看到的 v 为零或CPUs 小于您请求的 v，请申请提高限额。有关如何查看当前限制和请求新限制的信息，请参阅《Amazon EC2 用户指南》**中的 [Amazon EC2 服务配额](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html)。 # 运行作业失败时在 `slurm_resume.log` 中看到“`An error occurred (InsufficientInstanceCapacity)`”，或创建集群失败时在 `clustermgtd.log` 中看到该错误您遇到了容量不足问题。按照 Kn [https://aws.amazon.com/premiumsupport/owledge-center/ec2-insufficient-capacity-errors /进行问题故障排](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/)除。 # 看到节点处于 `DOWN` 状态并显示`Reason (Code:InsufficientInstanceCapacity)...` 您遇到了容量不足问题。按照 Kn [https://aws.amazon.com/premiumsupport/owledge-center/ec2-insufficient-capacity-errors /进行问题故障排](https://aws.amazon.com/premiumsupport/knowledge-center/ec2-insufficient-capacity-errors/)除。有关快速容量不足故障转移模式 AWS ParallelCluster的更多信息，请参阅。[Slurm 集群快速容量不足故障转移](slurm-short-capacity-fail-mode-v3.md) # 在 `slurm_resume.log` 中看到“`cannot change locale (en_US.utf-8) because it has an invalid name`” 如果 `yum` 安装过程失败，导致区域设置处于不一致状态，则可能会发生这种情况。例如，当用户终止安装过程时可能会导致这种情况。 **要验证原因，请执行下列操作：** + 运行 `su - pcluster-admin`。 Shell 显示错误，例如“`cannot change locale...no such file or directory`”。 + 运行 `localedef --list`。返回空列表或不包含默认区域设置。 + 使用 `yum history` 和 `yum history info #ID` 检查最后的 `yum` 命令。最后 ID 是否包含`Return-Code: Success`？如果最后 ID 不包含`Return-Code: Success`，则表明安装后脚本可能未成功运行。要解决此问题，请尝试使用 `yum reinstall glibc-all-langpacks` 重建区域设置。重建后，如果解决了该问题，则 `su - pcluster-admin` 不会显示错误或警告。 # 以上情形都不适用于我的情况要排查计算节点初始化问题，请参阅[排查节点初始化问题](troubleshooting-v3-scaling-issues.md#troubleshooting-v3-node-init)。查看您的场景是否包含在 “[GitHub 已知问题](https://github.com/aws/aws-parallelcluster/wiki) AWS ParallelCluster ” 中 GitHub。