

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 使用 `kubectl` 运行作业
<a name="sagemaker-hyperpod-eks-run-jobs-kubectl"></a>

**注意**  
训练作业自动恢复需要 Kubeflow Training Operator 发行版 `1.7.0`、`1.8.0` 或 `1.8.1`。

请注意，您应使用 Helm 图表在集群中安装 Kubeflow 训练操作员。有关更多信息，请参阅 [使用 Helm 在 Amazon EKS 集群上安装软件包](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md)。运行以下命令验证 Kubeflow Training Operator 控制面板是否设置正确。

```
kubectl get pods -n kubeflow
```

返回的输出结果应与下面类似。

```
NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s
```

**提交训练作业**

要运行训练作业，请准备作业配置文件并运行 [https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply) 命令，如下所示。

```
kubectl apply -f /path/to/training_job.yaml
```

**描述训练作业**

要检索提交给 EKS 集群的作业详情，请使用以下命令。它返回作业信息，如作业提交时间、完成时间、作业状态和配置详情。

```
kubectl get -o yaml training-job -n kubeflow
```

**停止训练作业并删除 EKS 资源**

要停止训练作业，请使用 kubectl delete。下面是停止根据配置文件 `pytorch_job_simple.yaml` 创建的训练作业的示例。

```
kubectl delete -f /path/to/training_job.yaml 
```

这应该返回以下输出内容。

```
pytorchjob.kubeflow.org "training-job" deleted
```

**启用作业自动恢复**

SageMaker HyperPod 支持 Kubernetes 作业的作业自动恢复功能，与 Kubeflow 训练操作员控制平面集成。

确保集群中有足够的节点已通过 SageMaker HyperPod 运行状况检查。节点的污点 `sagemaker.amazonaws.com/node-health-status` 应设置为 `Schedulable`。建议在作业 YAML 文件中包含一个节点选择器，以选择具有相应配置的节点，如下所示。

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

以下代码片段是如何修改 Kubeflow PyTorch 作业 YAML 配置以启用作业自动恢复功能的示例。您需要添加两个注释，并将 `restartPolicy` 设置为 `OnFailure`，如下所示。

```
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob 
metadata:
    name: pytorch-simple
    namespace: kubeflow
    annotations: { // config for job auto resume
      sagemaker.amazonaws.com/enable-job-auto-resume: "true"
      sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
          spec:
              nodeSelector:
                  sagemaker.amazonaws.com/node-health-status: Schedulable
```

**检查作业自动恢复状态**

运行以下命令检查作业自动恢复的状态。

```
kubectl describe pytorchjob -n kubeflow <job-name>
```

根据故障规律，您可能会看到以下两种 Kubeflow 训练作业重启规律。

**规律 1**：

```
Start Time:    2024-07-11T05:53:10Z
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-0
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-1
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m59s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-0
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-1
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m58s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
```

**规律 2**：

```
Events:
  Type    Reason                   Age    From                   Message
  ----    ------                   ----   ----                   -------
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-master-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-master-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-master-0
```