사전 조건 1단계: AWS 자격 증명 구성 2단계: SageMaker 실행 역할 생성 3단계: 모델 파라미터 구성 4단계: SageMaker 리소스 생성 및 엔드포인트 배포 5단계: 엔드포인트 간접 호출 6단계: 리소스 정리(선택 사항)

시작하기

이 가이드에서는 SageMaker 실시간 엔드포인트에 사용자 지정 Amazon Nova 모델을 배포하고, 추론 파라미터를 구성하며, 테스트를 위해 모델을 간접 호출하는 방법을 보여줍니다.

사전 조건

다음은 SageMaker 추론에서 Amazon Nova 모델을 배포하기 위한 사전 조건입니다.

AWS 계정 생성 - 아직 없는 경우 AWS 계정을 생성합니다.
필수 IAM 권한 - IAM 사용자 또는 역할에 다음과 같은 관리형 정책이 연결되어 있는지 확인합니다.
- AmazonSageMakerFullAccess
- AmazonS3FullAccess
필수 SDK/CLI 버전 - 다음 SDK 버전은 SageMaker 추론에서 Amazon Nova 모델을 사용하여 테스트 및 검증되었습니다.
- 리소스 기반 API 접근 방식에 대한 SageMaker Python SDK v3.0.0 이상(sagemaker>=3.0.0)
- API 직접 호출에 대한 Boto3 버전 1.35.0 이상(boto3>=1.35.0). 이 가이드의 예제에서는 이 접근 방식을 사용합니다.
서비스 할당량 증가 - SageMaker 추론 엔드포인트(예: ml.p5.48xlarge for endpoint usage)에서 사용하려는 ML 인스턴스 유형에 대한 Amazon SageMaker 서비스 할당량 증가를 요청합니다. 지원되는 인스턴스 유형의 목록은 지원되는 모델 및 인스턴스 섹션을 참조하세요. 증가를 요청하려면 Requesting a quota increase를 참조하세요. SageMaker 인스턴스 할당량에 대한 자세한 내용은 SageMaker endpoints and quotas를 참조하세요.

작은 정보

빠른 종단 간 배포를 위해 사용자 지정 Nova 모델 SageMaker 추론 노트북을 실행하여 단일 노트북에서 SageMaker 추론에 사용자 지정 Amazon Nova 모델을 배포할 수 있습니다.

1단계: AWS 자격 증명 구성

다음 방법 중 하나를 사용하여 AWS 자격 증명을 구성합니다.

옵션 1: AWS CLI(권장됨)


aws configure

메시지가 나타나면 AWS 액세스 키 ID, 시크릿 키 및 기본 리전을 입력합니다.

옵션 2: AWS 자격 증명 파일

~/.aws/credentials를 생성 또는 편집합니다.


[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

옵션 3: 환경 변수


export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

참고

AWS 자격 증명에 대한 자세한 내용은 구성 및 자격 증명 파일 설정을 참조하세요.

AWS 클라이언트 초기화

다음 코드를 사용하여 Python 스크립트 또는 노트북을 생성해 AWS SDK를 초기화하고 자격 증명을 확인합니다.


import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")

인증에 성공하면 AWS 계정 ID를 확인하는 출력이 표시됩니다.

2단계: SageMaker 실행 역할 생성

SageMaker 실행 역할은 사용자를 대신하여 모델 아티팩트용 Amazon S3 버킷 및 로깅용 CloudWatch와 같은 AWS 리소스에 액세스할 권한을 SageMaker에 부여하는 IAM 역할입니다.

실행 역할 생성

참고

IAM 역할을 생성하려면 iam:CreateRole 및 iam:AttachRolePolicy 권한이 필요합니다. 계속 진행하기 전에 IAM 사용자 또는 역할에 이러한 권한이 있는지 확인합니다.

다음 코드는 Amazon Nova 사용자 지정 모델을 배포하는 데 필요한 권한을 가진 IAM 역할을 생성합니다.


import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")

기존 실행 역할 사용(선택 사항)

SageMaker 실행 역할이 이미 있는 경우 대신 다음을 사용할 수 있습니다.


# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"

계정에서 기존 SageMaker 역할을 찾으려는 경우:


iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")

중요

실행 역할에는 Amazon S3 및 SageMaker 리소스에 액세스할 권한과 sagemaker.amazonaws.com과의 신뢰 관계가 있어야 합니다.

SageMaker 실행 역할에 대한 자세한 내용은 SageMaker Roles를 참조하세요.

3단계: 모델 파라미터 구성

Amazon Nova 모델에 대한 배포 파라미터를 구성합니다. 이러한 설정은 모델 동작, 리소스 할당 및 추론 특성을 제어합니다. 지원되는 인스턴스 유형과 각각에 대해 지원되는 CONTEXT_LENGTH 및 MAX_CONCURRENCY 값의 목록은 지원되는 모델 및 인스턴스 섹션을 참조하세요. 샘플링 기본값, 추측적 디코딩 및 양자화와 같은 추가 컨테이너 기능의 전체 목록은 추론 컨테이너 기능 섹션을 참조하세요.

필수 파라미터

IMAGE: Amazon Nova 추론 컨테이너에 대한 Docker 컨테이너 이미지 URI. 이는 AWS에서 제공합니다.
CONTEXT_LENGTH: 모델 컨텍스트 길이.
MAX_CONCURRENCY: 반복당 최대 시퀀스 수. GPU의 단일 배치 내에서 동시에 처리할 수 있는 개별 사용자 요청(프롬프트) 수에 대한 제한을 설정합니다. 범위: 0보다 큰 정수.

배포 구성


# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Required parameters
CONTEXT_LENGTH = "8000"        # Maximum total context length
MAX_CONCURRENCY = "8"          # Maximum concurrent sequences

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
    # Optional: add container feature environment variables here.
    # See "Inference Container Features" for the full list.
    # Examples:
    # 'DEFAULT_TEMPERATURE': '0.7',
    # 'DEFAULT_MAX_NEW_TOKENS': '512',
    # 'QUANTIZATION_DTYPE': 'fp8',
}

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")

배포별 파라미터 구성

이제 모델 아티팩트 위치 및 인스턴스 유형 선택을 포함하여 Amazon Nova 모델 배포에 대한 특정 파라미터를 구성합니다.

배포 식별자 설정


# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"

모델 아티팩트 위치 지정

훈련된 Amazon Nova 모델 아티팩트가 저장되는 Amazon S3 URI를 제공합니다. 이는 모델 훈련 또는 미세 조정 작업의 출력 위치여야 합니다.


# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"

모델 변형 및 인스턴스 유형 선택


# Configure model variant and instance type
TESTCASE = {
    "model": "micro",              # Options: micro, lite, lite2
    "instance": "ml.g5.12xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")

이름 지정 규칙

코드는 AWS 리소스에 대한 일관된 이름을 자동으로 생성합니다.

모델 이름: {JOB_NAME}-{model}-{instance-type}
엔드포인트 구성: {MODEL_NAME}-Config
엔드포인트 이름: {MODEL_NAME}-Endpoint

4단계: SageMaker 리소스 생성 및 엔드포인트 배포

SageMaker는 두 가지 접근 방식으로 모델을 실시간 엔드포인트에 배포합니다. 사용 사례에 맞는 접근 방식 선택:

추론 구성 요소(권장): 모델을 엔드포인트에 추론 구성 요소로 배포합니다. 이 접근 방식을 사용하면 단일 엔드포인트에서 여러 모델을 호스팅하고, 모델을 독립적으로 조정하고, 리소스 사용률을 최적화할 수 있습니다.
단일 모델 엔드포인트: 모델 객체 및 엔드포인트 구성을 사용하여 단일 모델을 엔드포인트에 직접 배포합니다. 이 접근 방식은 더 간단한 설정이 가능하고, 엔드포인트당 하나의 모델만 필요한 개발, 테스트 또는 워크로드에 적합합니다.

옵션 A: 추론 구성 요소를 사용하여 생성

추론 구성 요소를 사용하면 먼저 엔드포인트를 생성한 다음 모델을 해당 엔드포인트에 추론 구성 요소로 배포합니다. 이를 통해 모델이 엔드포인트 인프라에서 분리되기에 유연성을 높일 수 있습니다.

엔드포인트 구성 생성

모델을 지정하지 않고 인프라를 정의하는 엔드포인트 구성을 생성합니다. 인스턴스 유형 및 수는 엔드포인트 수준에서 관리됩니다.


# Create Endpoint Configuration for inference components
INFERENCE_COMPONENT_NAME = MODEL_NAME + "-IC"

try:
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[
            {
                'VariantName': 'primary',
                'InstanceType': INSTANCE_TYPE,
                'InitialInstanceCount': 1,
                'RoutingConfig': {
                    'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
                }
            }
        ],
        Tags=[
            {
                'Key': 'sagemaker:nova-inference-component',
                'Value': 'true'
            }
        ]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")

except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")

엔드포인트 생성 및 배포


import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")

# Wait for endpoint to be InService
print("Waiting for endpoint to be InService...")
print("This typically takes 5-10 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print(f"\nEndpoint '{ENDPOINT_NAME}' is ready.")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            break
        else:
            print(f"Status: {status}")
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)

추론 구성 요소 생성

엔드포인트가 InService가 되면 Amazon Nova 모델을 추론 구성 요소로 배포합니다.


try:
    ic_response = sagemaker.create_inference_component(
        InferenceComponentName=INFERENCE_COMPONENT_NAME,
        EndpointName=ENDPOINT_NAME,
        VariantName='primary',
        Specification={
            'Container': {
                'Image': IMAGE,
                'ArtifactUrl': MODEL_S3_LOCATION,
                'Environment': environment
            },
            'ComputeResourceRequirements': {
                'NumberOfCpuCoresRequired': 15,
                'NumberOfAcceleratorDevicesRequired': 4,
                'MinMemoryRequiredInMb': 25000
            }
        },
        RuntimeConfig={
            'CopyCount': 1
        }
    )
    print("Inference component creation initiated!")
    print(f"Inference Component ARN: {ic_response['InferenceComponentArn']}")

except sagemaker.exceptions.ClientError as e:
    print(f"Error creating inference component: {e}")

키 파라미터:

InferenceComponentName: 추론 구성 요소의 고유 식별자
EndpointName: 구성 요소를 배포할 엔드포인트
Image: Amazon Nova 추론을 위한 Docker 컨테이너 이미지 URI
ArtifactUrl: 모델 아티팩트의 Amazon S3 위치
Environment: 3단계에서 구성된 환경 변수
NumberOfCpuCoresRequired: 모델 사본당 필요한 CPU 코어 수
NumberOfAcceleratorDevicesRequired: 모델 사본당 필요한 액셀러레이터 디바이스(GPU) 수
MinMemoryRequiredInMb: 모델 사본당 필요한 최소 단위(MB)
CopyCount: 배포할 모델 사본 수

추론 구성 요소 배포 모니터링


# Wait for inference component to be InService
print("Waiting for inference component deployment...")
print("This typically takes 10-20 minutes as the model is loaded...\n")

while True:
    try:
        ic_desc = sagemaker.describe_inference_component(
            InferenceComponentName=INFERENCE_COMPONENT_NAME
        )
        ic_status = ic_desc['InferenceComponentStatus']
        
        if ic_status == 'Creating':
            print(f"⏳ Status: {ic_status} - Loading model artifacts...")
        elif ic_status == 'InService':
            print(f"✅ Status: {ic_status}")
            print(f"\nInference component '{INFERENCE_COMPONENT_NAME}' is ready!")
            break
        elif ic_status == 'Failed':
            print(f"❌ Status: {ic_status}")
            print(f"Failure Reason: {ic_desc.get('FailureReason', 'Unknown')}")
            break
        else:
            print(f"Status: {ic_status}")
    except Exception as e:
        print(f"Error checking inference component status: {e}")
        break
    
    time.sleep(30)

참고

5단계에서 엔드포인트를 간접 호출할 때 호출에 InferenceComponentName 파라미터를 포함해야 합니다. 세부 정보는 5단계를 참조하세요.

옵션 B: 단일 모델 엔드포인트로 생성

단일 모델 엔드포인트를 사용하여 SageMaker 모델 객체, 엔드포인트 구성을 생성한 다음 엔드포인트를 배포합니다. 이 접근 방식은 모델을 엔드포인트 구성에 직접 패키징합니다.

SageMaker 모델 생성

다음 코드는 Amazon Nova 모델 아티팩트를 참조하는 SageMaker 모델을 생성합니다.


try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")

키 파라미터:

ModelName: 모델의 고유한 식별자
Image: Amazon Nova 추론을 위한 Docker 컨테이너 이미지 URI
ModelDataSource: 모델 아티팩트의 Amazon S3 위치
Environment: 3단계에서 구성된 환경 변수
ExecutionRoleArn: 2단계의 IAM 역할
EnableNetworkIsolation: 향상된 보안을 위해 True로 설정(컨테이너가 아웃바운드 네트워크 직접 호출을 하지 못하도록 방지)

엔드포인트 구성 생성

그런 다음 배포 인프라를 정의하는 엔드포인트 구성을 생성합니다.


# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")

키 파라미터:

VariantName: 이 모델 변형의 식별자(단일 모델 배포의 경우 'primary' 사용)
ModelName: 위에서 생성된 모델 참조
InitialInstanceCount: 배포할 인스턴스 수(1로 시작, 필요한 경우 나중에 조정)
InstanceType: 3단계에서 선택한 ML 인스턴스 유형

엔드포인트 배포


import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")

엔드포인트 생성 모니터링

다음 코드는 배포가 완료될 때까지 엔드포인트 상태를 폴링합니다.


# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds

리소스 생성 확인

리소스가 성공적으로 생성되었는지 확인할 수 있습니다.


# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")

엔드포인트가 준비되었는지 확인

선택한 접근 방식과 무관하게 엔드포인트 구성을 확인할 수 있습니다.


# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break

엔드포인트 생성 실패 문제 해결

일반적인 실패 이유:

용량 부족: 요청한 인스턴스 유형을 사용자 리전에서 사용할 수 없음
- 해결 방법: 다른 인스턴스 유형을 시도하거나 할당량 증가 요청
IAM 권한: 실행 역할에 필요한 권한이 없음
- 해결 방법: 역할에 Amazon S3 모델 아티팩트 및 필요한 SageMaker 권한에 대한 액세스 권한이 있는지 확인
모델 아티팩트를 찾을 수 없음: Amazon S3 URI가 잘못되었거나 이에 액세스할 수 없음
- 해결 방법: Amazon S3 URI를 확인하고 버킷 권한을 확인한 다음, 올바른 리전에 있는지 확인
리소스 제한: 엔드포인트 또는 인스턴스에 대한 계정 제한을 초과함
- 해결 방법: Service Quotas 또는 AWS Support를 통해 서비스 할당량 증가 요청

참고

실패한 엔드포인트를 삭제하고 다시 시작해야 하는 경우:


sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)

5단계: 엔드포인트 간접 호출

엔드포인트가 InService 상태가 되면 추론 요청을 전송하여 Amazon Nova 모델에서 예측을 생성할 수 있습니다. SageMaker는 동기식 엔드포인트(스트리밍/비스트리밍 모드에서 실시간) 및 비동기식 엔드포인트(배치 처리를 위한 Amazon S3 기반)를 지원합니다.

런타임 클라이언트 설정

적절한 제한 시간 설정을 사용하여 SageMaker 런타임 클라이언트를 생성합니다.


import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)

범용 추론 함수 생성

다음 함수는 스트리밍 요청과 비스트리밍 요청을 모두 처리합니다. 4단계에 정의된 INFERENCE_COMPONENT_NAME 변수가 사용됩니다. 추론 구성 요소(옵션 A)를 사용하여 배포한 경우 MODEL_NAME + "-IC"로 설정되었습니다. 단일 모델 엔드포인트(옵션 B)를 사용하여 배포한 경우 정의되지 않았으므로 이 단계를 실행하기 전에 None으로 설정합니다.


# Only needed if you followed Option B (single model endpoints) in Step 4:
# INFERENCE_COMPONENT_NAME = None

def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    Supports both inference component and single model endpoint deployments.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    # Build invoke parameters
    invoke_params = {
        'EndpointName': ENDPOINT_NAME,
        'ContentType': 'application/json',
        'Body': body
    }
    
    # Add InferenceComponentName if using inference components
    if INFERENCE_COMPONENT_NAME:
        invoke_params['InferenceComponentName'] = INFERENCE_COMPONENT_NAME
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(**invoke_params)
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            invoke_params['Accept'] = 'application/json'
            response = runtime_client.invoke_endpoint(**invoke_params)
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")

예제 1: 비스트리밍 채팅 완료

대화형 상호 작용에 채팅 형식을 사용합니다.


# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)

샘플 응답:


{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}

예제 2: 간단한 텍스트 완료

간단한 텍스트 생성에 대한 완료 형식을 사용합니다.


# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)

샘플 응답:


{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}

예제 3: 스트리밍 채팅 완료


# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)

샘플 스트리밍 출력:


Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once"},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon"},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a"},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time"},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]

예제 4: 멀티모달 채팅 완료

이미지 및 텍스트 입력에 멀티모달 형식을 사용합니다.


# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)

샘플 응답:


{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}

6단계: 리소스 정리(선택 사항)

불필요한 비용이 발생하지 않도록 이 자습서 중에 생성한 AWS 리소스를 삭제합니다. 추론 요청을 적극적으로 생성하지 않더라도 SageMaker 엔드포인트에서는 실행 중에 요금이 발생합니다.

중요

테이블 삭제는 영구적이며 실행 취소할 수 없습니다. 계속 진행하기 전에 이러한 리소스가 더 이상 필요하지 않은지 확인합니다.

클라이언트 정리 시작


import boto3
import time

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

추론 구성 요소 삭제(옵션 A를 사용하는 경우)

추론 구성 요소를 사용하여 배포한 경우 엔드포인트를 삭제하기 전에 먼저 추론 구성 요소를 삭제합니다.


# Delete inference component (Option A only)
try:
    print("Deleting inference component...")
    sagemaker.delete_inference_component(InferenceComponentName=INFERENCE_COMPONENT_NAME)
    print(f"✅ Inference component '{INFERENCE_COMPONENT_NAME}' deletion initiated")
except Exception as e:
    print(f"❌ Error deleting inference component: {e}")

# Wait for inference component to be deleted before proceeding
print("Waiting for inference component deletion...")
while True:
    try:
        sagemaker.describe_inference_component(InferenceComponentName=INFERENCE_COMPONENT_NAME)
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Inference component successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break

엔드포인트 삭제


try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")

참고

엔드포인트 삭제는 비동기식입니다. 삭제 상태를 모니터링할 수 있습니다.


import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break

엔드포인트 구성 삭제

엔드포인트가 삭제된 후 엔드포인트 구성을 제거합니다.


try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")

모델 삭제(옵션 B만 해당)

단일 모델 엔드포인트를 사용한 경우 SageMaker 모델 객체를 제거합니다.


try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")

javascript가 브라우저에서 비활성화되거나 사용이 불가합니다.

AWS 설명서를 사용하려면 Javascript가 활성화되어야 합니다. 지침을 보려면 브라우저의 도움말 페이지를 참조하십시오.

문서 규칙

SageMaker 추론

컨테이너 기능