기계 번역으로 제공되는 번역입니다. 제공된 번역과 원본 영어의 내용이 상충하는 경우에는 영어 버전이 우선합니다.

# HyperPod 추론 문제 해결
<a name="sagemaker-hyperpod-model-deployment-ts"></a>

이 문제 해결 가이드는 Amazon SageMaker HyperPod 추론 배포 및 작업 중에 발생할 수 있는 일반적인 문제를 다룹니다. 이러한 문제에는 일반적으로 VPC 네트워킹 구성, IAM 권한, Kubernetes 리소스 관리 및 성공적인 모델 배포를 방해하거나 배포가 실패하거나 보류 상태로 유지될 수 있는 운영자 연결 문제가 포함됩니다.

이 문제 해결 가이드는 다음 용어를 사용합니다. **문제 해결 단계는** 문제를 식별하고 조사하는 진단 절차이며, **해결**은 식별된 문제를 해결하기 위한 특정 작업을 제공하고, **확인**은 솔루션이 올바르게 작동했는지 확인합니다.

**Topics**
+ [SageMaker AI 콘솔을 통한 추론 운영자 설치 실패](sagemaker-hyperpod-model-deployment-ts-console-cfn-failures.md)
+ [AWS CLI를 통한 추론 연산자 설치 실패](sagemaker-hyperpod-model-deployment-ts-cli.md)
+ [인증서 다운로드 제한 시간](sagemaker-hyperpod-model-deployment-ts-certificate.md)
+ [모델 배포 문제](sagemaker-hyperpod-model-deployment-ts-deployment-issues.md)
+ [VPC ENI 권한 문제](sagemaker-hyperpod-model-deployment-ts-permissions.md)
+ [IAM 신뢰 관계 문제](sagemaker-hyperpod-model-deployment-ts-trust.md)
+ [NVIDIA GPU 플러그인 누락 오류](sagemaker-hyperpod-model-deployment-ts-gpu.md)
+ [추론 연산자가 시작되지 않음](sagemaker-hyperpod-model-deployment-ts-startup.md)

# SageMaker AI 콘솔을 통한 추론 운영자 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-failures"></a>

**개요:** 빠른 설치 또는 사용자 지정 설치를 사용하여 SageMaker AI 콘솔을 통해 추론 연산자를 설치하는 경우 다양한 문제로 인해 기본 CloudFormation 스택이 실패할 수 있습니다. 이 섹션에서는 일반적인 장애 시나리오와 해결 방법을 다룹니다.

## 빠른 설치 또는 사용자 지정 설치를 통한 추론 연산자 추가 기능 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-stack-failed"></a>

**문제:** HyperPod 클러스터 생성이 성공적으로 완료되었지만 추론 연산자 추가 기능 설치가 실패합니다.

**일반적인 원인:**
+ 클러스터 노드에서 포드 용량 제한을 초과했습니다. 추론 연산자 설치에는 최소 13개의 포드가 필요합니다. 권장되는 최소 인스턴스 유형은 입니다`ml.c5.4xlarge`.
+ IAM 권한 문제
+ 리소스 할당량 제약 조건
+ 네트워크 또는 VPC 구성 문제

### 증상 및 진단
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-symptoms"></a>

**증상:**
+ 추론 연산자 추가 기능은 콘솔에 CREATE\$1FAILED 또는 DEGRADED 상태를 표시합니다.
+ 추가 기능과 연결된 CloudFormation 스택이 CREATE\$1FAILED 상태임
+ 설치 진행 상황이 중지되거나 오류 메시지가 표시됨

**진단 단계:**

1. 추론 연산자 추가 기능 상태를 확인합니다.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

1. 포드 제한 문제를 확인합니다.

   ```
   # Check current pod count per node
   kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable.pods, capacity: .status.capacity.pods}'
   
   # Check pods running on each node
   kubectl get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c
   
   # Check for pod evictions or failures
   kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "pod\|limit\|quota"
   ```

1. CloudFormation 스택 상태 확인(콘솔 설치를 사용하는 경우):

   ```
   # List CloudFormation stacks related to the cluster
   aws cloudformation list-stacks \
       --region $REGION \
       --query "StackSummaries[?contains(StackName, '$EKS_CLUSTER_NAME') && StackStatus=='CREATE_FAILED'].{Name:StackName,Status:StackStatus,Reason:StackStatusReason}" \
       --output table
   
   # Get detailed stack events
   aws cloudformation describe-stack-events \
       --stack-name <stack-name> \
       --region $REGION \
       --query "StackEvents[?ResourceStatus=='CREATE_FAILED']" \
       --output table
   ```

### 해결 방법
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-resolution"></a>

설치 실패를 해결하려면 현재 구성을 저장하고, 실패한 추가 기능을 삭제하고, 기본 문제를 해결한 다음 SageMaker AI 콘솔(권장) 또는 AWS CLI를 통해 추론 연산자를 다시 설치합니다.

**1단계: 현재 구성 저장**
+ 삭제하기 전에 추가 기능 구성을 추출하고 저장합니다.

  ```
  # Save the current configuration
  aws eks describe-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION \
      --query 'addon.configurationValues' \
      --output text > addon-config-backup.json
  
  # Verify the configuration was saved
  cat addon-config-backup.json
  
  # Pretty print for readability
  cat addon-config-backup.json | jq '.'
  ```

**2단계: 실패한 추가 기능 삭제**
+ 추론 연산자 추가 기능을 삭제합니다.

  ```
  aws eks delete-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION
  
  # Wait for deletion to complete
  echo "Waiting for add-on deletion..."
  aws eks wait addon-deleted \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION 2>/dev/null || sleep 60
  ```

**3단계: 기본 문제 해결**

실패 원인에 따라 적절한 해결 방법을 선택합니다.

문제가 포드 제한을 초과한 경우:

```
# The inference operator requires a minimum of 13 pods.
# The minimum recommended instance type is ml.c5.4xlarge.
#
# Option 1: Add instance group with higher pod capacity
# Different instance types support different maximum pod counts
# For example: m5.large (29 pods), m5.xlarge (58 pods), m5.2xlarge (58 pods)
aws sagemaker update-cluster \
    --cluster-name $HYPERPOD_CLUSTER_NAME \
    --region $REGION \
    --instance-groups '[{"InstanceGroupName":"worker-group-2","InstanceType":"ml.m5.xlarge","InstanceCount":2}]'

# Option 2: Scale existing node group to add more nodes
aws eks update-nodegroup-config \
    --cluster-name $EKS_CLUSTER_NAME \
    --nodegroup-name <nodegroup-name> \
    --scaling-config minSize=2,maxSize=10,desiredSize=5 \
    --region $REGION

# Option 3: Clean up unused pods
kubectl delete pods --field-selector status.phase=Failed --all-namespaces
kubectl delete pods --field-selector status.phase=Succeeded --all-namespaces
```

**4단계: 추론 연산자 재설치**

기본 문제를 해결한 후 다음 방법 중 하나를 사용하여 추론 연산자를 다시 설치합니다.
+ 사용자 **지정 설치가 포함된 SageMaker AI 콘솔(권장):** 이전 설치의 기존 IAM 역할 및 TLS 버킷을 재사용합니다. 단계는 [방법 1: SageMaker AI 콘솔을 통해 HyperPod 추론 추가 기능 설치(권장)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui)를 참조하세요.
+ **AWS 저장된 구성이 있는 CLI:** 1단계에서 백업한 구성을 사용하여 추가 기능을 다시 설치합니다. 전체 CLI 설치 단계는 섹션을 참조하세요[방법 2: AWS CLI를 사용하여 추론 연산자 설치](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-addon).

  ```
  aws eks create-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --addon-version v1.0.0-eksbuild.1 \
      --configuration-values file://addon-config-backup.json \
      --region $REGION
  ```
+ **빠른 설치가 포함된 SageMaker AI 콘솔:** 새 IAM 역할, TLS 버킷 및 종속성 추가 기능을 자동으로 생성합니다. 단계는 [방법 1: SageMaker AI 콘솔을 통해 HyperPod 추론 추가 기능 설치(권장)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui)를 참조하세요.

**5단계: 설치 성공 확인**

```
# Check add-on status
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health}" \
    --output table

# Verify pods are running
kubectl get pods -n hyperpod-inference-system

# Check operator logs
kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
```

## Kueue 웹후크가 준비되지 않아 Cert-manager 설치에 실패했습니다.
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-webhook-race"></a>

**문제:** 작업 거버넌스(Kueue) 웹후크 서비스에 사용 가능한 엔드포인트가 없으므로 cert-manager 추가 기능 설치가 실패하고 웹후크 오류가 발생합니다. 이는 작업 거버넌스 웹후크 포드가 완전히 실행되기 전에 cert-manager가 리소스를 생성하려고 할 때 발생하는 경합 조건입니다. 이는 클러스터 생성 중에 추론 연산자와 함께 태스크 거버넌스 추가 기능을 설치할 때 발생할 수 있습니다.

### 증상 및 진단
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-symptoms"></a>

**오류 메시지:**

```
AdmissionRequestDenied
Internal error occurred: failed calling webhook "mdeployment.kb.io": failed to call webhook: 
Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-apps-v1-deployment?timeout=10s": 
no endpoints available for service "kueue-webhook-service"
```

**근본 원인:**
+ 작업 거버넌스 추가 기능은 모든 배포 생성을 가로채는 변형 웹후크를 설치하고 등록합니다.
+ Cert-manager 추가 기능은 작업 거버넌스 웹후크 포드가 준비되기 전에 배포 리소스를 생성하려고 시도합니다.
+ Kubernetes 승인 제어가 작업 거버넌스 웹후크를 호출하지만 엔드포인트가 없음(포드가 아직 실행되지 않음)

**진단 단계:**

1. cert-manager 추가 기능 상태 확인:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

### 해결 방법
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-resolution"></a>

**솔루션: cert-manager 삭제 및 재설치**

작업 거버넌스 웹후크는 60초 이내에 준비됩니다. cert-manager 추가 기능을 삭제하고 다시 설치하기만 하면 됩니다.

1. 실패한 cert-manager 추가 기능을 삭제합니다.

   ```
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

1. 작업 거버넌스 웹후크가 준비될 때까지 30\$160초 기다린 다음 cert-manager 추가 기능을 다시 설치합니다.

   ```
   sleep 60
   
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

# AWS CLI를 통한 추론 연산자 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-cli"></a>

**개요:** AWS CLI를 통해 추론 연산자를 설치하는 경우 종속성이 누락되어 추가 기능 설치가 실패할 수 있습니다. 이 섹션에서는 일반적인 CLI 설치 실패 시나리오와 해결 방법을 다룹니다.

## CSI 드라이버 누락으로 인해 추론 추가 기능 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-missing-csi-drivers"></a>

**문제:** 필요한 CSI 드라이버 종속성이 EKS 클러스터에 설치되지 않았기 때문에 추론 연산자 추가 기능 생성이 실패합니다.

**증상 및 진단:**

**오류 메시지:**

다음 오류는 추가 기능 생성 로그 또는 추론 연산자 로그에 나타납니다.

```
S3 CSI driver not installed (missing CSIDriver s3.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.

FSx CSI driver not installed (missing CSIDriver fsx.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.
```

**진단 단계:**

1. CSI 드라이버가 설치되어 있는지 확인합니다.

   ```
   # Check for S3 CSI driver
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   
   # Check for FSx CSI driver  
   kubectl get csidriver fsx.csi.aws.com
   kubectl get pods -n kube-system | grep fsx
   ```

1. EKS 추가 기능 상태 확인:

   ```
   # List all add-ons
   aws eks list-addons --cluster-name $EKS_CLUSTER_NAME --region $REGION
   
   # Check specific CSI driver add-ons
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION 2>/dev/null || echo "S3 CSI driver not installed"
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION 2>/dev/null || echo "FSx CSI driver not installed"
   ```

1. 추론 연산자 추가 기능 상태 확인:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**​해결 방법:**

**1단계: 누락된 S3 CSI 드라이버 설치**

1. S3 CSI 드라이버에 대한 IAM 역할 생성(아직 생성되지 않은 경우):

   ```
   # Set up service account role ARN (from installation steps)
   export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "S3 CSI Role ARN: $S3_CSI_ROLE_ARN"
   ```

1. S3 CSI 드라이버 추가 기능 설치:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --addon-version v1.14.1-eksbuild.1 \
       --service-account-role-arn $S3_CSI_ROLE_ARN \
       --region $REGION
   ```

1. S3 CSI 드라이버 설치를 확인합니다.

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
   
   # Verify CSI driver is available
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   ```

**2단계: 누락된 FSx CSI 드라이버 설치**

1. FSx CSI 드라이버에 대한 IAM 역할 생성(아직 생성되지 않은 경우):

   ```
   # Set up service account role ARN (from installation steps)
   export FSX_CSI_ROLE_ARN=$(aws iam get-role --role-name $FSX_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "FSx CSI Role ARN: $FSX_CSI_ROLE_ARN"
   ```

1. FSx CSI 드라이버 추가 기능 설치:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --addon-version v1.6.0-eksbuild.1 \
       --service-account-role-arn $FSX_CSI_ROLE_ARN \
       --region $REGION
   
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
   
   # Verify FSx CSI driver is running
   kubectl get pods -n kube-system | grep fsx
   ```

**3단계: 모든 종속성 확인**

누락된 종속성을 설치한 후 추론 연산자 설치를 다시 시도하기 전에 올바르게 실행되고 있는지 확인합니다.

```
# Check all required add-ons are active
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify all pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

## 모델 배포 중에 추론 사용자 지정 리소스 정의가 누락되었습니다.
<a name="sagemaker-hyperpod-model-deployment-ts-crd-not-exist"></a>

**문제:** 모델 배포를 생성하려고 할 때 사용자 지정 리소스 정의(CRDs)가 누락되었습니다. 이 문제는 최종 사용자가 있는 모델 배포를 정리하지 않고 이전에 추론 추가 기능을 설치하고 삭제한 경우에 발생합니다.

**증상 및 진단:**

**근본 원인:**

먼저 모든 모델 배포를 제거하지 않고 추론 추가 기능을 삭제하면 최종 사용자가 있는 사용자 지정 리소스가 클러스터에 남아 있습니다. CRDs를 삭제하려면 먼저 이러한 최종 사용자를 완료해야 합니다. 추가 기능 삭제 프로세스는 CRD 삭제가 완료될 때까지 기다리지 않으므로 CRDs는 종료 상태로 유지되고 새 설치를 방지합니다.

**이 문제를 진단하려면**

1. CRDs 존재하는지 확인합니다.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

1. 멈춘 사용자 지정 리소스가 있는지 확인합니다.

   ```
   # Check for JumpStartModel resources
   kubectl get jumpstartmodels -A
   
   # Check for InferenceEndpointConfig resources
   kubectl get inferenceendpointconfigs -A
   ```

1. 멈춘 리소스가 있는지 최종 사용자를 검사합니다.

   ```
   # Example for a specific JumpStartModel
   kubectl get jumpstartmodels <model-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   
   # Example for a specific InferenceEndpointConfig
   kubectl get inferenceendpointconfigs <config-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   ```

**​해결 방법:**

추론 추가 기능을 제거할 때 삭제되지 않은 모든 모델 배포에서 최종 사용자를 수동으로 제거합니다. 멈춘 각 사용자 지정 리소스에 대해 다음 단계를 완료합니다.

**JumpStartModel 리소스에서 최종 사용자를 제거하려면**

1. 모든 네임스페이스에서 모든 JumpStartModel 리소스를 나열합니다.

   ```
   kubectl get jumpstartmodels -A
   ```

1. 각 JumpStartModel 리소스에 대해 리소스를 패치하여 metadata.finalizer를 빈 배열로 설정하여 최종자를 제거합니다.

   ```
   kubectl patch jumpstartmodels <model-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   다음 예제에서는 kv-l1-only라는 리소스를 패치하는 방법을 보여줍니다.

   ```
   kubectl patch jumpstartmodels kv-l1-only -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. 모델 인스턴스가 삭제되었는지 확인합니다.

   ```
   kubectl get jumpstartmodels -A
   ```

   모든 리소스가 정리되면 다음 출력이 표시됩니다.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=jumpstartmodels": the server could not find the requested resource (get jumpstartmodels.inference.sagemaker.aws.amazon.com)
   ```

1. JumpStartModel CRD가 제거되었는지 확인합니다.

   ```
   kubectl get crd | grep jumpstartmodels.inference.sagemaker.aws.amazon.com
   ```

   CRD가 성공적으로 제거되면이 명령은 출력을 반환하지 않습니다.

**InferenceEndpointConfig 리소스에서 최종 사용자를 제거하려면**

1. 모든 네임스페이스에서 모든 InferenceEndpointConfig 리소스를 나열합니다.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

1. 각 InferenceEndpointConfig 리소스에 대해 최종자를 제거합니다.

   ```
   kubectl patch inferenceendpointconfigs <config-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   다음 예제에서는 my-inference-config라는 리소스를 패치하는 방법을 보여줍니다.

   ```
   kubectl patch inferenceendpointconfigs my-inference-config -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. 구성 인스턴스가 삭제되었는지 확인합니다.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

   모든 리소스가 정리되면 다음 출력이 표시됩니다.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=inferenceendpointconfigs": the server could not find the requested resource (get inferenceendpointconfigs.inference.sagemaker.aws.amazon.com)
   ```

1. InferenceEndpointConfig CRD가 제거되었는지 확인합니다.

   ```
   kubectl get crd | grep inferenceendpointconfigs.inference.sagemaker.aws.amazon.com
   ```

   CRD가 성공적으로 제거되면이 명령은 출력을 반환하지 않습니다.

**추론 추가 기능을 다시 설치하려면**

멈춘 리소스를 모두 정리하고 CRDs 제거되었는지 확인한 후 추론 추가 기능을 다시 설치합니다. 자세한 내용은 [EKS 추가 기능을 사용하여 추론 연산자 설치](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon) 단원을 참조하십시오.

**확인:**

1. 추론 추가 기능이 성공적으로 설치되었는지 확인합니다.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   ```

   상태는 활성이어야 하고 상태는 정상이어야 합니다.

1. CRDs 올바르게 설치되었는지 확인합니다.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

   출력에 추론 관련 CRDs가 나열됩니다.

1. 새 모델 배포 생성을 테스트하여 문제가 해결되었는지 확인합니다.

   ```
   # Create a test deployment using your preferred method
   kubectl apply -f <your-model-deployment.yaml>
   ```

**예방책**:

이 문제를 방지하려면 추론 추가 기능을 제거하기 전에 다음 단계를 완료하세요.

1. 모든 모델 배포를 삭제합니다.

   ```
   # Delete all JumpStartModel resources
   kubectl delete jumpstartmodels --all -A
   
   # Delete all InferenceEndpointConfig resources
   kubectl delete inferenceendpointconfigs --all -A
   
   # Wait for all resources to be fully deleted
   kubectl get jumpstartmodels -A
   kubectl get inferenceendpointconfigs -A
   ```

1. 모든 사용자 지정 리소스가 삭제되었는지 확인합니다.

1. 모든 리소스가 정리되었는지 확인한 후 추론 추가 기능을 삭제합니다.

## cert-manager 누락으로 인해 추론 추가 기능 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-missing-cert-manager"></a>

**문제:** cert-manager EKS 추가 기능이 설치되지 않아 추론 연산자 추가 기능 생성이 실패하여 사용자 지정 리소스 정의(CRDs.

**증상 및 진단:**

**오류 메시지:**

다음 오류는 추가 기능 생성 로그 또는 추론 연산자 로그에 나타납니다.

```
Missing required CRD: certificaterequests.cert-manager.io. 
The cert-manager add-on is not installed. Please install cert-manager and see the troubleshooting guide for more information.
```

**진단 단계:**

1. cert-manager가 설치되어 있는지 확인합니다.

   ```
   # Check for cert-manager CRDs
   kubectl get crd | grep cert-manager
   kubectl get pods -n cert-manager
   
   # Check EKS add-on status
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION 2>/dev/null || echo "Cert-manager not installed"
   ```

1. 추론 연산자 추가 기능 상태 확인:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**​해결 방법:**

**1단계: cert-manager 추가 기능 설치**

1. cert-manager EKS 추가 기능을 설치합니다.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --addon-version v1.18.2-eksbuild.2 \
       --region $REGION
   ```

1. cert-manager 설치 확인:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION
   
   # Verify cert-manager pods are running
   kubectl get pods -n cert-manager
   
   # Verify CRDs are installed
   kubectl get crd | grep cert-manager | wc -l
   # Expected: Should show multiple cert-manager CRDs
   ```

**2단계: 추론 연산자 설치 재시도**

1. cert-manager를 설치한 후 추론 연산자 설치를 다시 시도합니다.

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall the inference operator add-on
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. 설치를 모니터링합니다.

   ```
   # Check installation status
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify inference operator pods are running
   kubectl get pods -n hyperpod-inference-system
   ```

## ALB 컨트롤러 누락으로 인해 추론 추가 기능 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-missing-alb"></a>

**문제:** 추론 연산자 추가 기능이 설치되지 않았거나 추론 추가 기능에 대해 제대로 구성되지 않았기 때문에 추론 연 AWS Load Balancer 추가 기능 생성이 실패합니다.

**증상 및 진단:**

**오류 메시지:**

다음 오류는 추가 기능 생성 로그 또는 추론 연산자 로그에 표시됩니다.

```
ALB Controller not installed (missing aws-load-balancer-controller pods). 
Please install the Application Load Balancer Controller and see the troubleshooting guide for more information.
```

**진단 단계:**

1. ALB 컨트롤러가 설치되어 있는지 확인합니다.

   ```
   # Check for ALB Controller pods
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   
   # Check ALB Controller service account
   kubectl get serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null || echo "ALB Controller service account not found"
   kubectl get serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null || echo "ALB Controller service account not found in inference namespace"
   ```

1. 추론 연산자 추가 기능 구성을 확인합니다.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**​해결 방법:**

설정에 따라 다음 옵션 중 하나를 선택합니다.

**옵션 1: 추론 추가 기능이 ALB 컨트롤러를 설치하도록 허용(권장)**
+ 추가 기능 구성에서 ALB 역할이 생성되고 올바르게 구성되었는지 확인합니다.

  ```
  # Verify ALB role exists
  export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "ALB Role ARN: $ALB_ROLE_ARN"
  
  # Update your addon-config.json to enable ALB
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": true,
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**옵션 2: 기존 ALB 컨트롤러 설치 사용**
+ ALB 컨트롤러가 이미 설치되어 있는 경우 기존 설치를 사용하도록 추가 기능을 구성합니다.

  ```
  # Update your addon-config.json to disable ALB installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": false
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**3단계: 추론 연산자 설치 재시도**

1. 업데이트된 구성으로 추론 연산자 추가 기능을 다시 설치합니다.

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. ALB 컨트롤러가 작동하는지 확인합니다.

   ```
   # Check ALB Controller pods
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   
   # Check service account annotations
   kubectl describe serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null
   ```

## KEDA 연산자 누락으로 인해 추론 추가 기능 설치 실패
<a name="sagemaker-hyperpod-model-deployment-ts-missing-keda"></a>

**문제:** KEDA(Kubernetes Event Driven Autoscaler) 연산자가 설치되어 있지 않거나 추론 추가 기능에 대해 제대로 구성되지 않았기 때문에 추론 연산자 추가 기능 생성이 실패합니다.

**증상 및 진단:**

**오류 메시지:**

다음 오류는 추가 기능 생성 로그 또는 추론 연산자 로그에 표시됩니다.

```
KEDA operator not installed (missing keda-operator pods). 
KEDA can be installed separately in any namespace or via the Inference addon.
```

**진단 단계:**

1. KEDA 연산자가 설치되어 있는지 확인합니다.

   ```
   # Check for KEDA operator pods in common namespaces
   kubectl get pods -n keda-system | grep keda-operator 2>/dev/null || echo "KEDA not found in keda-system namespace"
   kubectl get pods -n kube-system | grep keda-operator 2>/dev/null || echo "KEDA not found in kube-system namespace"
   kubectl get pods -n hyperpod-inference-system | grep keda-operator 2>/dev/null || echo "KEDA not found in inference namespace"
   
   # Check for KEDA CRDs
   kubectl get crd | grep keda 2>/dev/null || echo "KEDA CRDs not found"
   
   # Check KEDA service account
   kubectl get serviceaccount keda-operator -A 2>/dev/null || echo "KEDA service account not found"
   ```

1. 추론 연산자 추가 기능 구성을 확인합니다.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**​해결 방법:**

설정에 따라 다음 옵션 중 하나를 선택합니다.

**옵션 1: 추론 추가 기능이 KEDA를 설치하도록 허용(권장)**
+ 추가 기능 구성에서 KEDA 역할이 생성되고 올바르게 구성되었는지 확인합니다.

  ```
  # Verify KEDA role exists
  export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "KEDA Role ARN: $KEDA_ROLE_ARN"
  
  # Update your addon-config.json to enable KEDA
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": true,
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**옵션 2: 기존 KEDA 설치 사용**
+ KEDA가 이미 설치되어 있는 경우 기존 설치를 사용하도록 추가 기능을 구성합니다.

  ```
  # Update your addon-config.json to disable KEDA installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": false
    }
  }
  EOF
  ```

**3단계: 추론 연산자 설치 재시도**

1. 업데이트된 구성으로 추론 연산자 추가 기능을 다시 설치합니다.

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. KEDA가 작동하는지 확인합니다.

   ```
   # Check KEDA pods
   kubectl get pods -n hyperpod-inference-system | grep keda
   kubectl get pods -n kube-system | grep keda
   kubectl get pods -n keda-system | grep keda 2>/dev/null
   
   # Check KEDA CRDs
   kubectl get crd | grep scaledobjects
   kubectl get crd | grep scaledjobs
   
   # Check KEDA service account annotations
   kubectl describe serviceaccount keda-operator -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n kube-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n keda-system 2>/dev/null
   ```

# 인증서 다운로드 제한 시간
<a name="sagemaker-hyperpod-model-deployment-ts-certificate"></a>

SageMaker AI 엔드포인트를 배포할 때 VPC 환경에서 인증 기관(CA) 인증서를 다운로드할 수 없어 생성 프로세스가 실패합니다. 자세한 구성 단계는 [관리자 안내서](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb)를 참조하세요.

**오류 메시지:**

SageMaker AI 엔드포인트 CloudWatch 로그에 다음 오류가 나타납니다.

```
Error downloading CA certificate: Connect timeout on endpoint URL: "https://****.s3.<REGION>.amazonaws.com/****/***.pem"
```

**근본 원인:**
+ 이 문제는 추론 운영자가 VPC 내의 Amazon S3에서 자체 서명된 인증서에 액세스할 수 없을 때 발생합니다.
+ Amazon S3 VPC 엔드포인트의 적절한 구성은 인증서 액세스에 필수적입니다.

**​해결 방법:**

1. Amazon S3 VPC 엔드포인트가 없는 경우:
   + [관리자 안내서](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb)의 섹션 5.3의 구성에 따라 Amazon S3 VPC 엔드포인트를 생성합니다.

1. Amazon S3 VPC 엔드포인트가 이미 있는 경우:
   + 서브넷 라우팅 테이블이 VPC 엔드포인트를 가리키도록 구성되어 있는지(게이트웨이 엔드포인트를 사용하는 경우) 또는 프라이빗 DNS가 인터페이스 엔드포인트에 대해 활성화되어 있는지 확인합니다.
   + Amazon S3 VPC 엔드포인트는 섹션 5.3 엔드포인트 생성 단계에 언급된 구성과 유사해야 합니다.

# 모델 배포 문제
<a name="sagemaker-hyperpod-model-deployment-ts-deployment-issues"></a>

**개요:**이 섹션에서는 보류 중 상태, 실패한 배포, 배포 진행 상황 모니터링을 포함하여 모델 배포 중에 발생하는 일반적인 문제를 다룹니다.

## 모델 배포가 보류 중 상태로 멈춤
<a name="sagemaker-hyperpod-model-deployment-ts-pending"></a>

모델을 배포할 때 배포는 장기간 "보류 중" 상태로 유지됩니다. 이는 추론 연산자가 HyperPod 클러스터에서 모델 배포를 시작할 수 없음을 나타냅니다.

**영향을 받는 구성 요소:**

정상적인 배포 중에 추론 연산자는 다음을 수행해야 합니다.
+ 모델 포드 배포
+ 로드 밸런서 생성
+ SageMaker AI 엔드포인트 생성

**문제 해결 단계:**

1. 추론 연산자 포드 상태를 확인합니다.

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   예상 출력 예제:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. 추론 연산자 로그를 검토하고 연산자 로그에서 오류 메시지를 검사합니다.

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**찾아야 할 사항:**
+ 연산자 로그의 오류 메시지
+ 연산자 포드의 상태
+ 배포 관련 경고 또는 실패

**참고**  
정상 배포는 적절한 시간 내에 "보류 중" 상태를 넘어 진행되어야 합니다. 문제가 지속되면 추론 연산자 로그에서 특정 오류 메시지를 검토하여 근본 원인을 확인합니다.

## 모델 배포 실패 상태 문제 해결
<a name="sagemaker-hyperpod-model-deployment-ts-failed"></a>

모델 배포가 "실패" 상태가 되면 다음 세 가지 구성 요소 중 하나에서 오류가 발생할 수 있습니다.
+ 모델 포드 배포
+ 로드 밸런서 생성
+ SageMaker AI 엔드포인트 생성

**문제 해결 단계:**

1. 추론 연산자 상태를 확인합니다.

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   예상 결과:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. 연산자 로그를 검토합니다.

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**찾아야 할 사항:**

연산자 로그에는 실패한 구성 요소가 표시됩니다.
+ 모델 포드 배포 실패
+ 로드 밸런서 생성 문제
+ SageMaker AI 엔드포인트 오류

## 모델 배포 진행 상황 확인
<a name="sagemaker-hyperpod-model-deployment-ts-progress"></a>

모델 배포 진행 상황을 모니터링하고 잠재적 문제를 식별하려면 kubectl 명령을 사용하여 다양한 구성 요소의 상태를 확인할 수 있습니다. 이를 통해 배포가 정상적으로 진행 중인지 또는 모델 포드 생성, 로드 밸런서 설정 또는 SageMaker AI 엔드포인트 구성 단계에서 문제가 발생했는지 확인할 수 있습니다.

**방법 1: JumpStart 모델 상태 확인**

```
kubectl describe jumpstartmodel.inference.sagemaker.aws.amazon.com/<model-name> -n <namespace>
```

**모니터링할 주요 상태 표시기:**

1. 배포 상태
   + 찾기`Status.State`: 표시해야 함 `DeploymentComplete`
   + 확인 `Status.Deployment Status.Available Replicas`
   + `Status.Conditions` 배포 진행 상황 모니터링

1. SageMaker AI 엔드포인트 상태
   + 확인`Status.Endpoints.Sagemaker.State`: 표시해야 함 `CreationCompleted`
   + 확인 `Status.Endpoints.Sagemaker.Endpoint Arn`

1. TLS 인증서 상태
   + `Status.Tls Certificate` 세부 정보 보기
   + 에서 인증서 만료 확인 `Last Cert Expiry Time`

**방법 2: 추론 엔드포인트 구성 확인**

```
kubectl describe inferenceendpointconfig.inference.sagemaker.aws.amazon.com/<deployment_name> -n <namespace>
```

**공통 상태:**
+ `DeploymentInProgress`: 초기 배포 단계
+ `DeploymentComplete`: 배포 성공
+ `Failed`: 배포 실패

**참고**  
이벤트 섹션에서 경고 또는 오류를 모니터링합니다. 복제본 수가 예상 구성과 일치하는지 확인합니다. 모든 조건이 정상 배포에 `Status: True` 대해 표시되는지 확인합니다.

# VPC ENI 권한 문제
<a name="sagemaker-hyperpod-model-deployment-ts-permissions"></a>

VPC에서 네트워크 인터페이스를 생성할 수 있는 권한이 충분하지 않아 SageMaker AI 엔드포인트 생성이 실패합니다.

**오류 메시지:**

```
Please ensure that the execution role for variant AllTraffic has sufficient permissions for creating an endpoint variant within a VPC
```

**근본 원인:**

추론 운영자의 실행 역할에는 VPC에서 네트워크 인터페이스(ENI)를 생성하는 데 필요한 Amazon EC2 권한이 없습니다.

**​해결 방법:**

추론 연산자의 실행 역할에 다음 IAM 권한을 추가합니다.

```
{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterfacePermission"
     ],
    "Resource": "*"
}
```

**확인:**

권한을 추가한 후:

1. 실패한 엔드포인트 삭제(있는 경우)

1. 엔드포인트 생성 재시도

1. 배포 상태가 성공적으로 완료되었는지 모니터링

**참고**  
이 권한은 VPC 모드에서 실행되는 SageMaker AI 엔드포인트에 필수적입니다. 실행 역할에 필요한 다른 모든 VPC 관련 권한도 있는지 확인합니다.

# IAM 신뢰 관계 문제
<a name="sagemaker-hyperpod-model-deployment-ts-trust"></a>

HyperPod 추론 연산자가 STS AssumeRoleWithWebIdentity 오류로 시작되지 않아 IAM 신뢰 관계 구성 문제를 나타냅니다.

**오류 메시지:**

```
failed to enable inference watcher for HyperPod cluster *****: operation error SageMaker: UpdateClusterInference, 
get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, 
operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ****, 
api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
```

**​해결 방법:**

다음 구성으로 추론 연산자의 IAM 실행 역할의 신뢰 관계를 업데이트합니다.

다음과 같이 자리 표시자를 바꿉니다.
+ `<ACCOUNT_ID>`: AWS 계정 ID
+ `<REGION>`: AWS 리전
+ `<OIDC_ID>`: Amazon EKS 클러스터의 OIDC 공급자 ID

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
            "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:sub": "system:serviceaccount:<namespace>:<service-account-name>",
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:aud": "sts.amazonaws.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

**확인:**

신뢰 관계를 업데이트한 후:

1. IAM 콘솔에서 역할 구성 확인

1. 필요한 경우 추론 연산자를 다시 시작합니다.

1. 성공적인 시작을 위한 운영자 로그 모니터링

# NVIDIA GPU 플러그인 누락 오류
<a name="sagemaker-hyperpod-model-deployment-ts-gpu"></a>

사용 가능한 GPU 노드가 있음에도 불구하고 GPU 부족 오류로 인해 모델 배포가 실패합니다. 이는 NVIDIA 디바이스 플러그인이 HyperPod 클러스터에 설치되지 않은 경우에 발생합니다.

**오류 메시지:**

```
0/15 nodes are available: 10 node(s) didn't match Pod's node affinity/selector, 
5 Insufficient nvidia.com/gpu. preemption: 0/15 nodes are available: 
10 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod.
```

**근본 원인:**
+ Kubernetes는 NVIDIA 디바이스 플러그인 없이 GPU 리소스를 감지할 수 없습니다.
+ GPU 워크로드에 대한 예약 실패 발생

**​해결 방법:**

다음을 실행하여 NVIDIA GPU 플러그인을 설치합니다.

```
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/refs/tags/v0.17.1/deployments/static/nvidia-device-plugin.yml
```

**확인 단계:**

1. 플러그인 배포 상태를 확인합니다.

   ```
   kubectl get pods -n kube-system | grep nvidia-device-plugin
   ```

1. 이제 GPU 리소스가 표시되는지 확인합니다.

   ```
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu
   ```

1. 모델 배포 재시도

**참고**  
NVIDIA 드라이버가 GPU 노드에 설치되어 있는지 확인합니다. 플러그인 설치는 클러스터당 일회성 설정입니다. 설치하려면 클러스터 관리자 권한이 필요할 수 있습니다.

# 추론 연산자가 시작되지 않음
<a name="sagemaker-hyperpod-model-deployment-ts-startup"></a>

추론 연산자 포드를 시작하지 못하여 다음 오류 메시지가 발생합니다. 이 오류는 운영자 실행 역할에 대한 권한 정책이를 수행할 권한이 없기 때문입니다`sts:AssumeRoleWithWebIdentity`. 이로 인해 컨트롤 플레인에서 실행되는 연산자 부분이 시작되지 않습니다.

**오류 메시지:**

```
Warning Unhealthy 5m46s (x22 over 49m) kubelet Startup probe failed: Get "http://10.1.100.59:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
```

**근본 원인:**
+ 추론 연산자 실행 역할의 권한 정책은 리소스에 대한 권한 부여 토큰에 액세스하도록 설정되지 않았습니다.

**​해결 방법:**

HyperPod 추론 연산자에 `EXECUTION_ROLE_ARN` 대해의 실행 역할에 대해 다음 정책을 설정합니다.

```
HyperpodInferenceAccessPolicy-ml-cluster to include all resources
```

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": "*"
        }
    ]
}
```

------

**확인 단계:**

1. 정책을 변경합니다.

1. HyperPod 추론 연산자 포드를 종료합니다.

1. 포드는 예외를 발생시키지 않고 다시 시작됩니다.