# SageMaker 추론
<a name="nova-model-sagemaker-inference"></a>

이제 SageMaker 추론에서 사용자 지정 Amazon Nova 모델을 사용할 수 있습니다. SageMaker에서 Amazon Nova를 사용하면 훈련된 사용자 지정 Amazon Nova 모델에서 예측 또는 추론을 가져올 수 있습니다. SageMaker는 모든 ML 추론 요구 사항을 충족하는 데 도움이 되는 다양한 ML 인프라 및 모델 배포 옵션을 제공합니다. SageMaker 추론을 사용하면 모델 배포를 확장하고, 프로덕션에서 모델을 더 효과적으로 관리하며, 운영 부담을 줄일 수 있습니다.

SageMaker는 지연 시간이 짧은 추론을 위한 실시간 엔드포인트 및 요청 배치 처리를 위한 비동기 엔드포인트와 같은 다양한 추론 옵션을 제공합니다. 사용 사례에 적합한 추론 옵션을 활용하면 효율적인 모델 배포 및 추론을 보장할 수 있습니다. SageMaker 추론에 대한 자세한 내용은 [Deploy models for inference](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html)를 참조하세요.

**중요**  
SageMaker 추론에서는 전체 순위 사용자 지정 모델 및 LoRA 병합 모델만 지원됩니다. 병합되지 않은 LoRA 모델 및 기본 모델의 경우 Amazon Bedrock을 사용합니다.

## 특성
<a name="nova-sagemaker-inference-features"></a>

SageMaker 추론에서 Amazon Nova 모델에 대해 사용할 수 있는 기능은 다음과 같습니다.

**모델 기능**
+ 텍스트 생성

**배포 및 조정**
+ 사용자 지정 인스턴스 선택이 포함된 실시간 엔드포인트
+ 오토 스케일링 - 트래픽 패턴에 따라 용량을 자동으로 조정하여 비용과 GPU 사용률을 최적화합니다. 자세한 내용은 [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com//sagemaker/latest/dg/endpoint-auto-scaling.html)를 참조하세요.
+ 실시간 토큰 생성을 위한 스트리밍 API 지원

**모니터링 및 최적화**
+ 모니터링 및 알림을 위한 Amazon CloudWatch 통합
+ VPC 구성을 통한 가용 영역 인지 지연 시간 최적화

**개발 도구**
+ AWS CLI 지원 – 자세한 내용은 [AWS CLI Command Reference for SageMaker](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/)를 참조하세요.
+  SDK 지원을 통한 노트북 통합

## 지원되는 모델 및 인스턴스
<a name="nova-sagemaker-inference-supported"></a>

SageMaker 추론 엔드포인트를 생성하는 경우 두 가지 환경 변수(`CONTEXT_LENGTH` 및 `MAX_CONCURRENCY`)를 설정하여 배포를 구성할 수 있습니다.
+ `CONTEXT_LENGTH` - 요청당 최대 총 토큰 길이(입력 \$1 출력)
+ `MAX_CONCURRENCY` - 엔드포인트가 지원하는 최대 동시 요청 수

다음 표에는 지원되는 Amazon Nova 모델, 인스턴스 유형 및 지원되는 구성이 나와 있습니다. MAX\$1CONCURRENCY 값은 각 CONTEXT\$1LENGTH 설정에서 지원되는 최대 동시성을 나타냅니다.


****  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/ko_kr/nova/latest/nova2-userguide/nova-model-sagemaker-inference.html)

**참고**  
표시된 MAX\$1CONCURRENCY 값은 각 CONTEXT\$1LENGTH 설정의 상한입니다. 동일한 동시성으로 더 짧은 컨텍스트 길이를 사용할 수 있지만, 이러한 값을 초과하면 SageMaker 엔드포인트 생성에 실패합니다.  
예를 들어 ml.g5.12xlarge를 사용하는 Amazon Nova Micro의 경우:  
`CONTEXT_LENGTH=2000`, `MAX_CONCURRENCY=32` → 유효
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=32` → 거부됨(컨텍스트 길이 8,000에서 동시성 제한은 16)
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=4` → 유효
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=16` → 유효
`CONTEXT_LENGTH=10000` → 거부됨(이 인스턴스에서 최대 컨텍스트 길이: 8,000)

## 지원되는 AWS 리전
<a name="nova-sagemaker-inference-regions"></a>

다음 표에는 SageMaker 추론에서 Amazon Nova 모델을 사용할 수 있는 AWS 리전이 나와 있습니다.


****  

| 리전 이름 | 리전 코드 | 가용성 | 
| --- | --- | --- | 
| 미국 동부(버지니아 북부) | us-east-1 | Available | 
| 미국 서부(오리건) | us-west-2 | Available | 

## 지원되는 컨테이너 이미지
<a name="nova-sagemaker-inference-container-images"></a>

다음 표에는 SageMaker 추론에서 Amazon Nova 모델에 대한 컨테이너 이미지 URI가 리전별로 나와 있습니다. 각 리전에 대해 버전 관리된 태그(`v1.0.0`) 및 최신 태그(`SM-Inference-latest`)와 같은 두 가지 이미지 태그를 사용할 수 있습니다. 프로덕션 배포의 경우 버전 관리된 태그를 사용하는 것이 좋습니다.


****  

| 리전 | 컨테이너 이미지 URI | 
| --- | --- | 
| us-east-1 | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-inference-repo:SM-Inference-latest | 
| us-west-2 | 176779409107.dkr.ecr.us-west-2.amazonaws.com/nova-inference-repo:SM-Inference-latest | 

## 모범 사례
<a name="nova-sagemaker-inference-best-practices"></a>

SageMaker에서 모델을 배포하고 관리하는 모범 사례는 [Best Practices for SageMaker](https://docs.aws.amazon.com//sagemaker/latest/dg/best-practices.html)를 참조하세요.

## 지원
<a name="nova-sagemaker-inference-support"></a>

SageMaker 추론에서 Amazon Nova 모델에 대한 지원 및 관련 문제는 콘솔 또는 AWS 계정 관리자를 통해 AWS Support에 문의하세요.

**Topics**
+ [특성](#nova-sagemaker-inference-features)
+ [지원되는 모델 및 인스턴스](#nova-sagemaker-inference-supported)
+ [지원되는 AWS 리전](#nova-sagemaker-inference-regions)
+ [지원되는 컨테이너 이미지](#nova-sagemaker-inference-container-images)
+ [모범 사례](#nova-sagemaker-inference-best-practices)
+ [지원](#nova-sagemaker-inference-support)
+ [시작하기](nova-sagemaker-inference-getting-started.md)
+ [API 참조](nova-sagemaker-inference-api-reference.md)
+ [SageMaker 추론에 호스팅되는 모델 평가](nova-eval-on-sagemaker-inference.md)
+ [Amazon SageMaker 추론 침해 탐지에서 Amazon Nova Forge 모델 배포](nova-sagemaker-inference-abuse-detection.md)

# 시작하기
<a name="nova-sagemaker-inference-getting-started"></a>

이 가이드에서는 SageMaker 실시간 엔드포인트에 사용자 지정된 Amazon Nova 모델을 배포하고, 추론 파라미터를 구성하며, 테스트를 위해 모델을 간접 호출하는 방법을 보여줍니다.

## 사전 조건
<a name="nova-sagemaker-inference-prerequisites"></a>

다음은 SageMaker 추론에서 Amazon Nova 모델을 배포하기 위한 사전 조건입니다.
+ AWS 계정 생성 - 아직 없는 경우 [AWS 계정을 생성](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-set-up.html#sign-up-for-aws)합니다.
+ 필수 IAM 권한 - IAM 사용자 또는 역할에 다음과 같은 관리형 정책이 연결되어 있는지 확인합니다.
  + `AmazonSageMakerFullAccess`
  + `AmazonS3FullAccess`
+ 필수 SDK/CLI 버전 - 다음 SDK 버전은 SageMaker 추론에서 Amazon Nova 모델을 사용하여 테스트 및 검증되었습니다.
  + 리소스 기반 API 접근 방식에 대한 SageMaker Python SDK v3.0.0 이상(`sagemaker>=3.0.0`)
  + API 직접 호출에 대한 Boto3 버전 1.35.0 이상(`boto3>=1.35.0`). 이 가이드의 예제에서는 이 접근 방식을 사용합니다.
+ 서비스 할당량 증가 - SageMaker 추론 엔드포인트(예: `ml.p5.48xlarge for endpoint usage`)에서 사용하려는 ML 인스턴스 유형에 대한 Amazon SageMaker 서비스 할당량 증가를 요청합니다. 지원되는 인스턴스 유형의 목록은 [지원되는 모델 및 인스턴스](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported) 섹션을 참조하세요. 증가를 요청하려면 [Requesting a quota increase](https://docs.aws.amazon.com//servicequotas/latest/userguide/request-quota-increase.html)를 참조하세요. SageMaker 인스턴스 할당량에 대한 자세한 내용은 [SageMaker endpoints and quotas](https://docs.aws.amazon.com//general/latest/gr/sagemaker.html)를 참조하세요.

## 1단계: AWS 자격 증명 구성
<a name="nova-sagemaker-inference-step1"></a>

다음 방법 중 하나를 사용하여 AWS 자격 증명을 구성합니다.

**옵션 1: AWS CLI(권장됨)**

```
aws configure
```

메시지가 나타나면 AWS 액세스 키 ID, 시크릿 키 및 기본 리전을 입력합니다.

**옵션 2: AWS 자격 증명 파일**

`~/.aws/credentials`를 생성 또는 편집합니다.

```
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
```

**옵션 3: 환경 변수**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
```

**참고**  
AWS 자격 증명에 대한 자세한 내용은 [구성 및 자격 증명 파일 설정](https://docs.aws.amazon.com//cli/latest/userguide/cli-configure-files.html)을 참조하세요.

**AWS 클라이언트 초기화**

다음 코드를 사용하여 Python 스크립트 또는 노트북을 생성해 AWS SDK를 초기화하고 자격 증명을 확인합니다.

```
import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")
```

인증에 성공하면 AWS 계정 ID를 확인하는 출력이 표시됩니다.

## 2단계: SageMaker 실행 역할 생성
<a name="nova-sagemaker-inference-step2"></a>

SageMaker 실행 역할은 사용자를 대신하여 모델 아티팩트용 Amazon S3 버킷 및 로깅용 CloudWatch와 같은 AWS 리소스에 액세스할 권한을 SageMaker에 부여하는 IAM 역할입니다.

**실행 역할 생성**

**참고**  
IAM 역할을 생성하려면 `iam:CreateRole` 및 `iam:AttachRolePolicy` 권한이 필요합니다. 계속 진행하기 전에 IAM 사용자 또는 역할에 이러한 권한이 있는지 확인합니다.

다음 코드는 Amazon Nova 사용자 지정된 모델을 배포하는 데 필요한 권한을 가진 IAM 역할을 생성합니다.

```
import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")
```

**기존 실행 역할 사용(선택 사항)**

SageMaker 실행 역할이 이미 있는 경우 대신 다음을 사용할 수 있습니다.

```
# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"
```

계정에서 기존 SageMaker 역할을 찾으려는 경우:

```
iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")
```

**중요**  
실행 역할에는 Amazon S3 및 SageMaker 리소스에 액세스할 권한과 `sagemaker.amazonaws.com`과의 신뢰 관계가 있어야 합니다.

SageMaker 실행 역할에 대한 자세한 내용은 [SageMaker Roles](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-roles.html)를 참조하세요.

## 3단계: 모델 파라미터 구성
<a name="nova-sagemaker-inference-step3"></a>

Amazon Nova 모델에 대한 배포 파라미터를 구성합니다. 이러한 설정은 모델 동작, 리소스 할당 및 추론 특성을 제어합니다. 지원되는 인스턴스 유형과 각각에 대해 지원되는 CONTEXT\$1LENGTH 및 MAX\$1CONCURRENCY 값의 목록은 [지원되는 모델 및 인스턴스](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported) 섹션을 참조하세요.

**필수 파라미터**
+ `IMAGE`: Amazon Nova 추론 컨테이너에 대한 Docker 컨테이너 이미지 URI. 이는 AWS에서 제공합니다.
+ `CONTEXT_LENGTH`: 모델 컨텍스트 길이.
+ `MAX_CONCURRENCY`: 반복당 최대 시퀀스 수. GPU의 단일 배치 내에서 동시에 처리할 수 있는 개별 사용자 요청(프롬프트) 수에 대한 제한을 설정합니다. 범위: 0보다 큰 정수.

**선택적 생성 파라미터**
+ `DEFAULT_TEMPERATURE`: 생성 시 무작위성을 제어합니다. 범위: 0.0\$12.0(0.0 = 결정적, 높음 = 더 무작위).
+ `DEFAULT_TOP_P`: 토큰 선택을 위한 Nucleus 샘플링 임계치. 범위: 1e-10\$11.0.
+ `DEFAULT_TOP_K`: 가능성이 가장 큰 상위 K개 토큰으로 토큰 선택을 제한합니다. 범위: 정수 -1 이상(-1 = 제한 없음).
+ `DEFAULT_MAX_NEW_TOKENS`: 응답으로 생성할 최대 토큰 수(즉, 최대 출력 토큰). 범위: 정수 1 이상.
+ `DEFAULT_LOGPROBS`: 토큰당 반환할 로그 확률 수. 범위: 정수 1\$120.

**배포 구성**

```
# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Model Parameters
CONTEXT_LENGTH = "16000"       # Maximum total context length
MAX_CONCURRENCY = "2"          # Maximum concurrent sequences

# Optional: Default generation parameters (uncomment to use)
DEFAULT_TEMPERATURE = "0.0"   # Deterministic output
DEFAULT_TOP_P = "1.0"         # Consider all tokens
# DEFAULT_TOP_K = "50"        # Uncomment to limit to top 50 tokens
# DEFAULT_MAX_NEW_TOKENS = "2048"  # Uncomment to set max output tokens
# DEFAULT_LOGPROBS = "1"      # Uncomment to enable log probabilities

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
}

# Add optional parameters if defined
if 'DEFAULT_TEMPERATURE' in globals():
    environment['DEFAULT_TEMPERATURE'] = DEFAULT_TEMPERATURE
if 'DEFAULT_TOP_P' in globals():
    environment['DEFAULT_TOP_P'] = DEFAULT_TOP_P
if 'DEFAULT_TOP_K' in globals():
    environment['DEFAULT_TOP_K'] = DEFAULT_TOP_K
if 'DEFAULT_MAX_NEW_TOKENS' in globals():
    environment['DEFAULT_MAX_NEW_TOKENS'] = DEFAULT_MAX_NEW_TOKENS
if 'DEFAULT_LOGPROBS' in globals():
    environment['DEFAULT_LOGPROBS'] = DEFAULT_LOGPROBS

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")
```

**배포별 파라미터 구성**

이제 모델 아티팩트 위치 및 인스턴스 유형 선택을 포함하여 Amazon Nova 모델 배포에 대한 특정 파라미터를 구성합니다.

**배포 식별자 설정**

```
# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"
```

**모델 아티팩트 위치 지정**

훈련된 Amazon Nova 모델 아티팩트가 저장되는 Amazon S3 URI를 제공합니다. 이는 모델 훈련 또는 미세 조정 작업의 출력 위치여야 합니다.

```
# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"
```

**모델 변형 및 인스턴스 유형 선택**

```
# Configure model variant and instance type
TESTCASE = {
    "model": "lite2",              # Options: micro, lite, lite2
    "instance": "ml.p5.48xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")
```

**이름 지정 규칙**

코드는 AWS 리소스에 대한 일관된 이름을 자동으로 생성합니다.
+ 모델 이름: `{JOB_NAME}-{model}-{instance-type}`
+ 엔드포인트 구성: `{MODEL_NAME}-Config`
+ 엔드포인트 이름: `{MODEL_NAME}-Endpoint`

## 4단계: SageMaker 모델 및 엔드포인트 구성 생성
<a name="nova-sagemaker-inference-step4"></a>

이 단계에서는 Amazon Nova 모델 아티팩트를 참조하는 SageMaker 모델 객체와 모델 배포 방법을 정의하는 엔드포인트 구성이라는 두 가지 필수 리소스를 생성합니다.

**SageMaker 모델**: 추론 컨테이너 이미지, 모델 아티팩트 위치 및 환경 구성을 패키징하는 모델 객체. 이는 여러 엔드포인트에 배포할 수 있는 재사용 가능한 리소스입니다.

**엔드포인트 구성**: 인스턴스 유형, 인스턴스 수 및 모델 변형을 포함하여 배포를 위한 인프라 설정을 정의합니다. 이를 통해 모델 자체와 별도로 배포 설정을 관리할 수 있습니다.

**SageMaker 모델 생성**

다음 코드는 Amazon Nova 모델 아티팩트를 참조하는 SageMaker 모델을 생성합니다.

```
try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")
```

키 파라미터:
+ `ModelName`: 모델의 고유한 식별자
+ `Image`: Amazon Nova 추론을 위한 Docker 컨테이너 이미지 URI
+ `ModelDataSource`: 모델 아티팩트의 Amazon S3 위치
+ `Environment`: 3단계에서 구성된 환경 변수
+ `ExecutionRoleArn`: 2단계의 IAM 역할
+ `EnableNetworkIsolation`: 향상된 보안을 위해 True로 설정(컨테이너가 아웃바운드 네트워크 직접 호출을 하지 못하도록 방지)

**엔드포인트 구성 생성**

그런 다음 배포 인프라를 정의하는 엔드포인트 구성을 생성합니다.

```
# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")
```

키 파라미터:
+ `VariantName`: 이 모델 변형의 식별자(단일 모델 배포의 경우 'primary' 사용)
+ `ModelName`: 위에서 생성된 모델 참조
+ `InitialInstanceCount`: 배포할 인스턴스 수(1로 시작, 필요한 경우 나중에 조정)
+ `InstanceType`: 3단계에서 선택한 ML 인스턴스 유형

**리소스 생성 확인**

리소스가 성공적으로 생성되었는지 확인할 수 있습니다.

```
# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")
```

## 5단계: 엔드포인트 배포
<a name="nova-sagemaker-inference-step5"></a>

다음 단계는 SageMaker 실시간 엔드포인트를 생성하여 Amazon Nova 모델을 배포하는 것입니다. 이 엔드포인트는 모델을 호스팅하고 추론 요청을 위해 안전한 HTTPS 엔드포인트를 제공합니다.

엔드포인트 생성은 일반적으로 15\$130분이 걸립니다. AWS에서 인프라를 프로비저닝하고, 모델 아티팩트를 다운로드하며, 추론 컨테이너를 초기화하기 때문입니다.

**엔드포인트 생성**

```
import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")
```

**엔드포인트 생성 모니터링**

다음 코드는 배포가 완료될 때까지 엔드포인트 상태를 폴링합니다.

```
# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds
```

**엔드포인트가 준비되었는지 확인**

엔드포인트가 InService 상태가 되면 해당 구성을 확인할 수 있습니다.

```
# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break
```

**엔드포인트 생성 실패 문제 해결**

일반적인 실패 이유:
+ **용량 부족**: 요청한 인스턴스 유형을 사용자 리전에서 사용할 수 없음
  + 해결 방법: 다른 인스턴스 유형을 시도하거나 할당량 증가 요청
+ **IAM 권한**: 실행 역할에 필요한 권한이 없음
  + 해결 방법: 역할에 Amazon S3 모델 아티팩트 및 필요한 SageMaker 권한에 대한 액세스 권한이 있는지 확인
+ **모델 아티팩트를 찾을 수 없음**: Amazon S3 URI가 잘못되었거나 이에 액세스할 수 없음
  + 해결 방법: Amazon S3 URI를 확인하고 버킷 권한을 확인한 다음, 올바른 리전에 있는지 확인
+ **리소스 제한**: 엔드포인트 또는 인스턴스에 대한 계정 제한을 초과함
  + 해결 방법: Service Quotas 또는 AWS Support를 통해 서비스 할당량 증가 요청

**참고**  
실패한 엔드포인트를 삭제하고 다시 시작해야 하는 경우:  

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## 6단계: 엔드포인트 간접 호출
<a name="nova-sagemaker-inference-step6"></a>

엔드포인트가 InService 상태가 되면 추론 요청을 전송하여 Amazon Nova 모델에서 예측을 생성할 수 있습니다. SageMaker는 동기식 엔드포인트(스트리밍/비스트리밍 모드에서 실시간) 및 비동기식 엔드포인트(배치 처리를 위한 Amazon S3 기반)를 지원합니다.

**런타임 클라이언트 설정**

적절한 제한 시간 설정을 사용하여 SageMaker 런타임 클라이언트를 생성합니다.

```
import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)
```

**범용 추론 함수 생성**

다음 함수는 스트리밍 요청과 비스트리밍 요청을 모두 처리합니다.

```
def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Body=body
            )
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            response = runtime_client.invoke_endpoint(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Accept='application/json',
                Body=body
            )
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")
```

**예제 1: 비스트리밍 채팅 완료**

대화형 상호 작용에 채팅 형식을 사용합니다.

```
# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "reasoning_effort": "low",  # Options: "low", "high"
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)
```

**샘플 응답**:

```
{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}
```

**예제 2: 간단한 텍스트 완료**

간단한 텍스트 생성에 대한 완료 형식을 사용합니다.

```
# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)
```

**샘플 응답**:

```
{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}
```

**예제 3: 스트리밍 채팅 완료**

```
# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "reasoning_effort": "high",  # For more detailed reasoning
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)
```

**샘플 스트리밍 출력:**

```
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]
```

**예제 4: 멀티모달 채팅 완료**

이미지 및 텍스트 입력에 멀티모달 형식을 사용합니다.

```
# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)
```

**샘플 응답**:

```
{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}
```

## 7단계: 리소스 정리(선택 사항)
<a name="nova-sagemaker-inference-step7"></a>

불필요한 비용이 발생하지 않도록 이 자습서 중에 생성한 AWS 리소스를 삭제합니다. 추론 요청을 적극적으로 생성하지 않더라도 SageMaker 엔드포인트에서는 실행 중에 요금이 발생합니다.

**중요**  
테이블 삭제는 영구적이며 실행 취소할 수 없습니다. 계속 진행하기 전에 이러한 리소스가 더 이상 필요하지 않은지 확인합니다.

**엔드포인트 삭제**

```
import boto3

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")
```

**참고**  
엔드포인트 삭제는 비동기식입니다. 삭제 상태를 모니터링할 수 있습니다.  

```
import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break
```

**엔드포인트 구성 삭제**

엔드포인트가 삭제된 후 엔드포인트 구성을 제거합니다.

```
try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")
```

**모델 삭제**

SageMaker 모델 객체를 제거합니다.

```
try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")
```

# API 참조
<a name="nova-sagemaker-inference-api-reference"></a>

SageMaker의 Amazon Nova 모델은 추론에 대해 표준 SageMaker 런타임 API를 사용합니다. 전체 API 설명서는 [Test a deployed model](https://docs.aws.amazon.com//sagemaker/latest/dg/realtime-endpoints-test-endpoints.html)을 참조하세요.

## 엔드포인트 간접 호출
<a name="nova-sagemaker-inference-api-invocation"></a>

SageMaker의 Amazon Nova 모델은 다음과 같은 두 가지 간접 호출 방법을 지원합니다.
+ **동기식 간접 호출**: 실시간 비스트리밍 추론 요청에 대해 [InvokeEndpoint](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API를 사용합니다.
+ **스트리밍 간접 호출**: 실시간 스트리밍 추론 요청에 대해 [InvokeEndpointWithResponseStream](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html) API를 사용합니다.

## 요청 형식
<a name="nova-sagemaker-inference-api-request"></a>

Amazon Nova 모델은 다음과 같은 두 가지 요청 형식을 지원합니다.

**채팅 완료 형식**

대화형 상호 작용에 대해 다음 형식을 사용합니다.

```
{
  "messages": [
    {"role": "user", "content": "string"}
  ],
  "max_tokens": integer,
  "max_completion_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": boolean,
  "top_logprobs": integer,
  "reasoning_effort": "low" | "high",
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**텍스트 완료 형식**

간단한 텍스트 생성에 대해 이 형식을 사용합니다.

```
{
  "prompt": "string",
  "max_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": integer,
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**멀티모달 채팅 완료 형식**

이미지 및 텍스트 입력에 대해 이 형식을 사용합니다.

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": integer,
  "temperature": float,
  "top_p": float,
  "stream": boolean
}
```

**요청 파라미터**
+ `messages`(배열): 채팅 완료 형식의 경우. `role` 및 `content` 필드가 있는 메시지 객체의 배열. 콘텐츠는 텍스트 전용 문자열 또는 멀티모달 입력 배열일 수 있습니다.
+ `prompt`(문자열): 텍스트 완료 형식의 경우. 생성할 입력 텍스트.
+ `max_tokens`(정수): 응답에서 생성할 최대 토큰 수. 범위: 1 이상.
+ `max_completion_tokens`(정수): 채팅 완료에 대한 max\$1tokens의 대안. 생성할 최대 완료 토큰 수.
+ `temperature`(float): 생성 시 무작위성을 제어합니다. 범위: 0.0\$12.0(0.0 = 결정적, 2.0 = 최대 무작위성).
+ `top_p`(float): Nucleus 샘플링 임계치. 범위: 1e-10\$11.0.
+ `top_k`(정수): 가능성이 가장 큰 상위 K개 토큰으로 토큰 선택을 제한합니다. 범위: -1 이상(-1 = 제한 없음).
+ `stream`(부울): 응답 스트리밍 여부. 스트리밍의 경우 `true`, 비스트리밍의 경우 `false`로 설정합니다.
+ `logprobs`(부울/정수): 채팅 완료의 경우 부울을 사용합니다. 텍스트 완료의 경우 반환할 로그 확률 수에 정수를 사용합니다. 범위: 1\$120.
+ `top_logprobs`(정수): 로그 확률을 반환할 가능성이 가장 큰 토큰 수(채팅 완료만 해당).
+ `reasoning_effort`(문자열): 추론 노력 수준입니다. 옵션: 'low', 'high'(Nova 2 Lite 사용자 지정 모델의 채팅 완료만 해당).
+ `allowed_token_ids`(배열): 생성할 수 있는 토큰 ID 목록. 출력을 지정된 토큰으로 제한합니다.
+ `truncate_prompt_tokens`(정수): 제한을 초과하는 경우 프롬프트를 이 많은 토큰으로 자릅니다.
+ `stream_options`(객체): 스트리밍 응답에 대한 옵션. 스트리밍 응답에서 토큰 사용량을 포함하도록 `include_usage` 부울을 포함합니다.

## 응답 형식
<a name="nova-sagemaker-inference-api-response"></a>

응답 형식은 간접 호출 방법 및 요청 유형에 따라 다릅니다.

**채팅 완료 응답(비스트리밍)**

동기식 채팅 완료 요청의 경우:

```
{
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": [9906, 0, 358, 2157, 1049, 11, 1309, 345, 369, 6464, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  },
  "prompt_token_ids": [9906, 0, 358]
}
```

**텍스트 완료 응답(비스트리밍)**

동기식 텍스트 완료 요청의 경우:

```
{
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris, the capital and most populous city of France.",
      "logprobs": {
        "tokens": ["Paris", ",", " the", " capital"],
        "token_logprobs": [-0.31725305, -0.07918124, -0.12345678, -0.23456789],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403,
            "Rome": -2.1234567
          },
          {
            ",": -0.07918124,
            " is": -1.2345678
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_token_ids": [464, 6864, 315, 4881, 374],
      "token_ids": [3915, 11, 279, 6864, 323, 1455, 95551, 3363, 315, 4881, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}
```

**채팅 완료 스트리밍 응답**

스트리밍 채팅 완료 요청의 경우 응답은 서버 전송 이벤트(SSE)로 전송됩니다.

```
data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": "Hello",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              }
            ]
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null,
  "prompt_token_ids": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "! I'm"
      },
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}

data: [DONE]
```

**텍스트 완료 스트리밍 응답**

스트리밍 텍스트 완료 요청의 경우:

```
data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris",
      "logprobs": {
        "tokens": ["Paris"],
        "token_logprobs": [-0.31725305],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": ", the capital",
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "",
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16
  }
}

data: [DONE]
```

**응답 필드 설명**
+ `id`: 완료에 대한 고유한 식별자
+ `object`: 반환되는 객체 유형('chat.completion', 'text\$1completion', 'chat.completion.chunk')
+ `created`: 완료가 생성된 시간의 Unix 타임스탬프
+ `model`: 완료에 사용되는 모델
+ `choices`: 완료 선택 배열
+ `usage`: 프롬프트, 완료 및 총 토큰을 포함하는 토큰 사용 정보
+ `logprobs`: 토큰에 대한 로그 확률 정보(요청 시)
+ `finish_reason`: 모델 생성이 중지된 이유('stop', 'length', 'content\$1filter')
+ `delta`: 스트리밍 응답의 증분 콘텐츠
+ `reasoning`: reasoning\$1effort를 사용하는 경우 추론 콘텐츠
+ `token_ids`: 생성된 텍스트에 대한 토큰 ID 배열

전체 API 설명서는 [InvokeEndpoint API 참조](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) 및 [InvokeEndpointWithResponseStream API 참조](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html)를 참조하세요.

# SageMaker 추론에 호스팅되는 모델 평가
<a name="nova-eval-on-sagemaker-inference"></a>

이 가이드에서는 오픈 소스 평가 프레임워크인 [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)를 사용하여 SageMaker 추론 엔드포인트에 배포된 사용자 지정된 Amazon Nova 모델을 평가하는 방법을 설명합니다.

**참고**  
실습 연습은 [SageMaker Inspect AI quickstart notebook](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb)을 참조하세요.

## 개요
<a name="nova-eval-sagemaker-overview"></a>

AI 연구 커뮤니티의 표준화된 벤치마크를 사용하여 SageMaker 엔드포인트에 배포된 사용자 지정된 Amazon Nova 모델을 평가할 수 있습니다. 이 접근 방식을 사용하면 다음을 수행할 수 있습니다.
+ 사용자 지정된 Amazon Nova 모델(미세 조정, 증류 또는 기타 적응)을 대규모로 평가
+ 여러 엔드포인트 인스턴스에서 병렬 추론을 사용하여 평가 실행
+ MMLU, TruthfulQA 및 HumanEval과 같은 벤치마크를 사용하여 모델 성능 비교
+ 기존 SageMaker 인프라와 통합

## 지원되는 모델
<a name="nova-eval-sagemaker-supported-models"></a>

SageMaker 추론 공급자는 다음과 함께 작동합니다.
+ Amazon Nova 모델(Nova Micro, Nova Lite, Nova Lite 2)
+ vLLM 또는 OpenAI 호환 추론 서버를 통해 배포된 모델
+ OpenAI 채팅 완료 API 형식을 지원하는 모든 엔드포인트

## 사전 조건
<a name="nova-eval-sagemaker-prerequisites"></a>

시작하기 전에 다음을 갖추었는지 확인하세요.
+ SageMaker 엔드포인트를 생성하고 간접 호출할 권한이 있는 AWS 계정
+ AWS CLI, 환경 변수 또는 IAM 역할을 통해 구성된 AWS 자격 증명
+ Python 3.9 이상

**필수 IAM 권한**

IAM 사용자 또는 역할에 다음 권한이 필요합니다.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:InvokeEndpoint",
        "sagemaker:DescribeEndpoint"
      ],
      "Resource": "arn:aws:sagemaker:*:*:endpoint/*"
    }
  ]
}
```

## 1단계: SageMaker 엔드포인트 배포
<a name="nova-eval-sagemaker-step1"></a>

평가를 실행하기 전에 모델을 실행하는 SageMaker 추론 엔드포인트가 필요합니다.

Amazon Nova 모델을 사용하여 SageMaker 추론 엔드포인트를 생성하는 방법에 대한 지침은 [시작하기](nova-sagemaker-inference-getting-started.md) 섹션을 참조하세요.

엔드포인트가 `InService` 상태가 되면 평가 명령에 사용할 엔드포인트 이름을 기록합니다.

## 2단계: 평가 종속 항목 설치
<a name="nova-eval-sagemaker-step2"></a>

Python 가상 환경을 생성하고 필요한 패키지를 설치합니다.

```
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install uv for faster package installation
pip install uv

# Install Inspect AI and evaluation benchmarks
uv pip install inspect-ai inspect-evals

# Install AWS dependencies
uv pip install aioboto3 boto3 botocore openai
```

## 3단계: AWS 자격 증명 구성
<a name="nova-eval-sagemaker-step3"></a>

다음과 같은 인증 방법 중 하나를 선택합니다.

**옵션 1: AWS CLI(권장됨)**

```
aws configure
```

메시지가 나타나면 AWS 액세스 키 ID,시크릿 액세스 키 및 기본 리전을 입력합니다.

**옵션 2: 환경 변수**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2
```

**옵션 3: IAM 역할**

Amazon EC2 또는 SageMaker 노트북에서 실행하는 경우 인스턴스의 IAM 역할이 자동으로 사용됩니다.

**자격 증명 확인**

```
import boto3

sts = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Account: {identity['Account']}")
print(f"User/Role: {identity['Arn']}")
```

## 4단계: SageMaker 공급자 설치
<a name="nova-eval-sagemaker-step4"></a>

SageMaker 공급자를 사용하면 Inspect AI가 SageMaker 엔드포인트와 통신할 수 있습니다. 공급자 설치 프로세스는 [quickstart notebook](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb)에서 간소화됩니다.

## 5단계: 평가 벤치마크 다운로드
<a name="nova-eval-sagemaker-step5"></a>

Inspect Evals 리포지토리를 복제하여 표준 벤치마크에 액세스합니다.

```
git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
```

이 리포지토리에는 다음과 같은 벤치마크가 포함되어 있습니다.
+ MMLU 및 MMLU-Pro(지식 및 추론)
+ TruthfulQA(진실성)
+ HumanEval(코드 생성)
+ GSM8K(수학 추론)

## 6단계: 평가 실행
<a name="nova-eval-sagemaker-step6"></a>

SageMaker 엔드포인트를 사용하여 평가를 실행합니다.

```
cd inspect_evals/src/inspect_evals/

inspect eval mmlu_pro/mmlu_pro.py \
  --model sagemaker/my-nova-endpoint \
  -M region_name=us-west-2 \
  --max-connections 256 \
  --max-retries 100 \
  --display plain
```

**주요 파라미터**


| 파라미터 | 기본값 | 설명 | 
| --- | --- | --- | 
| --max-connections | 10 | 엔드포인트에 대한 병렬 요청 수. 인스턴스 수에 따라 조정합니다(예: 인스턴스 10개 × 25 = 250). | 
| --max-retries | 3 | 실패한 요청에 대해 재시도합니다. 대규모 평가의 경우 50\$1100을 사용합니다. | 
| -M region\$1name | us-east-1 | 엔드포인트가 배포된 AWS 리전. | 
| -M read\$1timeout | 600 | 요청 제한 시간(초). | 
| -M connect\$1timeout | 60 | 연결 제한 시간(초). | 

**조정 권장 사항**

다중 인스턴스 엔드포인트의 경우:

```
# 10-instance endpoint example
--max-connections 250   # ~25 connections per instance
--max-retries 100       # Handle transient errors
```

`--max-connections`를 너무 높게 설정하면 엔드포인트가 압도되어 스로틀링이 발생할 수 있습니다. 너무 낮게 설정하면 용량이 적게 사용됩니다.

## 7단계: 결과 보기
<a name="nova-eval-sagemaker-step7"></a>

Inspect AI 뷰어를 시작하여 평가 결과를 분석합니다.

```
inspect view
```

뷰어에 다음이 표시됩니다.
+ 전체 점수 및 지표
+ 모델 응답이 포함된 샘플별 결과
+ 오류 분석 및 실패 패턴

## 엔드포인트 관리
<a name="nova-eval-sagemaker-managing-endpoints"></a>

**엔드포인트 업데이트**

기존 엔드포인트를 새 모델 또는 구성으로 업데이트하는 방법:

```
import boto3

sagemaker = boto3.client('sagemaker', region_name=REGION)

# Create new model and endpoint configuration
# Then update the endpoint
sagemaker.update_endpoint(
    EndpointName=EXISTING_ENDPOINT_NAME,
    EndpointConfigName=NEW_ENDPOINT_CONFIG_NAME
)
```

**엔드포인트 삭제**

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## 사용자 지정 벤치마크 온보딩
<a name="nova-eval-sagemaker-custom-benchmarks"></a>

다음 워크플로를 사용하여 Inspect AI에 새 벤치마크를 추가할 수 있습니다.

1. 벤치마크의 데이터세트 형식 및 평가 지표 연구

1. `inspect_evals/`에서 유사한 구현 검토

1. 데이터세트 레코드를 Inspect AI 샘플로 변환하는 태스크 파일 생성

1. 적절한 솔버 및 점수 계산기 구현

1. 소규모 테스트 실행으로 검증

태스크 구조 예제:

```
from inspect_ai import Task, task
from inspect_ai.dataset import hf_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice

@task
def my_benchmark():
    return Task(
        dataset=hf_dataset("dataset_name", split="test"),
        solver=multiple_choice(),
        scorer=choice()
    )
```

## 문제 해결
<a name="nova-eval-sagemaker-troubleshooting"></a>

**일반적인 문제**

**엔드포인트 스로틀링 또는 제한 시간**
+ `--max-connections` 감소
+ `--max-retries` 증가
+ 엔드포인트 CloudWatch 지표에서 용량 문제 확인

**인증 오류**
+ AWS 자격 증명이 올바르게 구성되었는지 확인
+ IAM 권한에 `sagemaker:InvokeEndpoint`가 포함되었는지 확인

**모델 오류**
+ 엔드포인트가 `InService` 상태인지 확인
+ 모델이 OpenAI채팅 완료 API 형식을 지원하는지 확인

## 관련 리소스
<a name="nova-eval-sagemaker-related-resources"></a>
+ [Inspect AI 설명서](https://inspect.ai-safety-institute.org.uk/)
+ [평가 리포지토리 검사](https://github.com/UKGovernmentBEIS/inspect_evals)
+ [SageMaker 개발자 안내서](https://docs.aws.amazon.com//sagemaker/latest/dg/whatis.html)
+ [추론 모델 배포](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html)
+ [AWS CLI 구성](https://docs.aws.amazon.com//cli/latest/userguide/cli-chap-configure.html)

# Amazon SageMaker 추론 침해 탐지에서 Amazon Nova Forge 모델 배포
<a name="nova-sagemaker-inference-abuse-detection"></a>

AWS에서는 AI를 책임감 있게 사용하기 위해 최선을 다하고 있습니다. 잠재적 오용을 방지하기 위해 Amazon SageMaker 추론에서 Amazon Nova Forge 모델을 배포할 때 SageMaker 추론은 자동 침해 탐지 메커니즘을 구현하여 [책임 있는 AI](https://aws.amazon.com/ai/responsible-ai/policy/) 정책을 포함하여 AWS의 [이용 정책](https://aws.amazon.com/aup/)(AUP) 및 서비스 약관과 관련 잠재적 위반을 식별합니다.

침해 탐지 메커니즘은 완전히 자동화되어 있으므로, 사용자 입력 또는 모델 출력을 사람이 검토하거나 액세스할 필요가 없습니다.

자동 침해 탐지 기능은 다음과 같습니다.
+ **콘텐츠 분류** - 분류자를 사용하여 사용자 입력 및 모델 출력에 있는 유해한 콘텐츠(예: 폭력을 조장하는 콘텐츠)를 탐지합니다. 분류자는 모델 입력 및 출력을 처리하고 유해성의 유형과 신뢰도를 할당하는 알고리즘입니다. Amazon Nova Forge 모델 사용량에서 이러한 분류자를 실행할 수 있습니다. 분류 프로세스는 자동화되어 있으며 사용자 입력 또는 모델 출력을 사람이 검토하지 않습니다.
+ **패턴 식별** - 분류자 지표를 사용하여 잠재적 위반과 반복되는 행동을 식별합니다. 익명화된 분류자 지표를 컴파일하고 공유할 수 있습니다. Amazon SageMaker 추론은 사용자 입력 또는 모델 출력을 저장하지 않습니다.
+ **아동 성착취물(CSAM) 탐지 및 차단** - 사용자 (및 최종 사용자)가 Amazon SageMaker 추론에 업로드하는 콘텐츠에 대한 책임은 사용자 본인에게 있으며 이 콘텐츠에 불법 이미지가 없는지 확인해야 합니다. Amazon SageMaker 추론에서 Amazon Nova Forge 모델을 배포할 때 SageMaker는 CSAM의 배포를 중지하기 위해 자동 침해 탐지 메커니즘(해시 매칭 기술 또는 분류자 등)을 사용하여 명백한 CSAM을 탐지할 수 있습니다. Amazon SageMaker 추론에서 이미지 입력에서 명백한 CSAM을 탐지하면 Amazon SageMaker 추론에서 요청을 차단하고 사용자는 자동 오류 메시지를 받게 됩니다. Amazon SageMaker 추론은 National Center for Missing and Exploited Children(NCMEC) 또는 관련 기관에 보고서를 제출할 수도 있습니다. 당사는 CSAM을 심각한 사안으로 간주하므로 감지, 차단, 보고 메커니즘을 계속해서 업데이트할 예정입니다. 관련 법률에 따라 추가 조치를 취해야 할 수 있으며 이러한 조치에 대한 책임은 사용자에게 있습니다.

자동 침해 탐지 메커니즘으로 잠재적 위반 사항이 식별되면 당사는 사용자의 Amazon SageMaker 추론 사용 및 당사 서비스 약관 준수에 대한 정보를 요청할 수 있습니다. 사용자가 본 약관 또는 정책을 준수하지 않거나 준수할 수 없는 경우 AWS는 Amazon SageMaker 추론에 대한 사용자의 액세스 권한을 일시 중단할 수 있습니다. 또한 자동 테스트에서 모델 응답이 약관 및 정책과 일치하지 않는 것으로 탐지되는 경우에도 실패한 추론 작업에 대한 요금이 청구될 수 있습니다.

추가적인 질문이 있을 경우 AWS Support에 문의하시기 바랍니다. 자세한 내용은 [Amazon SageMaker FAQ](https://aws.amazon.com/sagemaker/ai/faqs/)를 참조하세요.