빠른 설치 또는 사용자 지정 설치를 통한 추론 연산자 추가 기능 설치 실패 Kueue 웹후크가 준비되지 않아 Cert-manager 설치에 실패했습니다.

SageMaker AI 콘솔을 통한 추론 운영자 설치 실패

개요: 빠른 설치 또는 사용자 지정 설치를 사용하여 SageMaker AI 콘솔을 통해 추론 연산자를 설치하는 경우 다양한 문제로 인해 기본 CloudFormation 스택이 실패할 수 있습니다. 이 섹션에서는 일반적인 장애 시나리오와 해결 방법을 다룹니다.

빠른 설치 또는 사용자 지정 설치를 통한 추론 연산자 추가 기능 설치 실패

문제: HyperPod 클러스터 생성이 성공적으로 완료되었지만 추론 연산자 추가 기능 설치가 실패합니다.

일반적인 원인:

클러스터 노드에서 포드 용량 제한을 초과했습니다. 추론 연산자 설치에는 최소 13개의 포드가 필요합니다. 권장되는 최소 인스턴스 유형은 입니다ml.c5.4xlarge.
IAM 권한 문제
리소스 할당량 제약 조건
네트워크 또는 VPC 구성 문제

증상 및 진단

증상:

추론 연산자 추가 기능은 콘솔에 CREATE_FAILED 또는 DEGRADED 상태를 표시합니다.
추가 기능과 연결된 CloudFormation 스택이 CREATE_FAILED 상태임
설치 진행 상황이 중지되거나 오류 메시지가 표시됨

진단 단계:

추론 연산자 추가 기능 상태를 확인합니다.


aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health,Issues:issues}" \
    --output json

포드 제한 문제를 확인합니다.


# Check current pod count per node
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable.pods, capacity: .status.capacity.pods}'

# Check pods running on each node
kubectl get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c

# Check for pod evictions or failures
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "pod\|limit\|quota"

CloudFormation 스택 상태 확인(콘솔 설치를 사용하는 경우):


# List CloudFormation stacks related to the cluster
aws cloudformation list-stacks \
    --region $REGION \
    --query "StackSummaries[?contains(StackName, '$EKS_CLUSTER_NAME') && StackStatus=='CREATE_FAILED'].{Name:StackName,Status:StackStatus,Reason:StackStatusReason}" \
    --output table

# Get detailed stack events
aws cloudformation describe-stack-events \
    --stack-name <stack-name> \
    --region $REGION \
    --query "StackEvents[?ResourceStatus=='CREATE_FAILED']" \
    --output table

해결 방법

설치 실패를 해결하려면 현재 구성을 저장하고, 실패한 추가 기능을 삭제하고, 기본 문제를 해결한 다음, SageMaker AI 콘솔(권장) 또는 AWS CLI를 통해 추론 연산자를 다시 설치합니다.

1단계: 현재 구성 저장

삭제하기 전에 추가 기능 구성을 추출하고 저장합니다.


# Save the current configuration
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query 'addon.configurationValues' \
    --output text > addon-config-backup.json

# Verify the configuration was saved
cat addon-config-backup.json

# Pretty print for readability
cat addon-config-backup.json | jq '.'

2단계: 실패한 추가 기능 삭제

추론 연산자 추가 기능을 삭제합니다.


aws eks delete-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION

# Wait for deletion to complete
echo "Waiting for add-on deletion..."
aws eks wait addon-deleted \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION 2>/dev/null || sleep 60

3단계: 기본 문제 해결

실패 원인에 따라 적절한 해결 방법을 선택합니다.

문제가 포드 제한을 초과한 경우:


# The inference operator requires a minimum of 13 pods.
# The minimum recommended instance type is ml.c5.4xlarge.
#
# Option 1: Add instance group with higher pod capacity
# Different instance types support different maximum pod counts
# For example: m5.large (29 pods), m5.xlarge (58 pods), m5.2xlarge (58 pods)
aws sagemaker update-cluster \
    --cluster-name $HYPERPOD_CLUSTER_NAME \
    --region $REGION \
    --instance-groups '[{"InstanceGroupName":"worker-group-2","InstanceType":"ml.m5.xlarge","InstanceCount":2}]'

# Option 2: Scale existing node group to add more nodes
aws eks update-nodegroup-config \
    --cluster-name $EKS_CLUSTER_NAME \
    --nodegroup-name <nodegroup-name> \
    --scaling-config minSize=2,maxSize=10,desiredSize=5 \
    --region $REGION

# Option 3: Clean up unused pods
kubectl delete pods --field-selector status.phase=Failed --all-namespaces
kubectl delete pods --field-selector status.phase=Succeeded --all-namespaces

4단계: 추론 연산자 재설치

기본 문제를 해결한 후 다음 방법 중 하나를 사용하여 추론 연산자를 다시 설치합니다.

사용자 지정 설치가 포함된 SageMaker AI 콘솔(권장): 이전 설치의 기존 IAM 역할 및 TLS 버킷을 재사용합니다. 단계는 방법 1: SageMaker AI 콘솔을 통해 HyperPod 추론 추가 기능 설치(권장)를 참조하세요.

AWS 저장된 구성이 있는 CLI: 1단계에서 백업한 구성을 사용하여 추가 기능을 다시 설치합니다. 전체 CLI 설치 단계는 섹션을 참조하세요방법 2: AWS CLI를 사용하여 추론 연산자 설치.


aws eks create-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --addon-version v1.0.0-eksbuild.1 \
    --configuration-values file://addon-config-backup.json \
    --region $REGION

빠른 설치가 포함된 SageMaker AI 콘솔: 새 IAM 역할, TLS 버킷 및 종속성 추가 기능을 자동으로 생성합니다. 단계는 방법 1: SageMaker AI 콘솔을 통해 HyperPod 추론 추가 기능 설치(권장)를 참조하세요.

5단계: 설치 성공 확인


# Check add-on status
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health}" \
    --output table

# Verify pods are running
kubectl get pods -n hyperpod-inference-system

# Check operator logs
kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50

Kueue 웹후크가 준비되지 않아 Cert-manager 설치에 실패했습니다.

문제: 작업 거버넌스(Kueue) 웹후크 서비스에 사용 가능한 엔드포인트가 없으므로 cert-manager 추가 기능 설치가 실패하고 웹후크 오류가 발생합니다. 이는 작업 거버넌스 웹후크 포드가 완전히 실행되기 전에 cert-manager가 리소스를 생성하려고 할 때 발생하는 경합 조건입니다. 이는 클러스터 생성 중에 추론 연산자와 함께 태스크 거버넌스 추가 기능을 설치할 때 발생할 수 있습니다.

증상 및 진단

오류 메시지:


AdmissionRequestDenied
Internal error occurred: failed calling webhook "mdeployment.kb.io": failed to call webhook: 
Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-apps-v1-deployment?timeout=10s": 
no endpoints available for service "kueue-webhook-service"

근본 원인:

작업 거버넌스 추가 기능은 모든 배포 생성을 가로채는 변형 웹후크를 설치하고 등록합니다.
Cert-manager 추가 기능은 작업 거버넌스 웹후크 포드가 준비되기 전에 배포 리소스를 생성하려고 시도합니다.
Kubernetes 승인 제어가 작업 거버넌스 웹후크를 호출하지만 엔드포인트가 없음(포드가 아직 실행되지 않음)

진단 단계:

cert-manager 추가 기능 상태 확인:


aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name cert-manager \
    --region $REGION \
    --query "addon.{Status:status,Health:health,Issues:issues}" \
    --output json

해결 방법

솔루션: cert-manager 삭제 및 재설치

태스크 거버넌스 웹후크는 60초 이내에 준비됩니다. cert-manager 추가 기능을 삭제하고 다시 설치하기만 하면 됩니다.

실패한 cert-manager 추가 기능을 삭제합니다.


aws eks delete-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name cert-manager \
    --region $REGION

작업 거버넌스 웹후크가 준비될 때까지 30~60초 기다린 다음 cert-manager 추가 기능을 다시 설치합니다.
```
sleep 60

aws eks create-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name cert-manager \
    --region $REGION
```

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

문제 해결

AWS CLI를 통한 추론 연산자 설치 실패