# SageMaker 推理
<a name="nova-model-sagemaker-inference"></a>

自定义 Amazon Nova 模型现已在 SageMaker 推理服务中推出。借助 SageMaker 上的 Amazon Nova，您可以对已训练好的自定义 Nova 模型执行预测（即推理）操作。SageMaker 提供多种 ML 基础设施和模型部署选项，以帮助满足您的所有 ML 推理需求。使用 SageMaker 推理，您能够实现模型部署弹性扩缩、在生产环境中更高效地管理模型，并降低运维负担。

SageMaker 支持多种推理方式，例如用于低延迟推理的实时端点，以及用于批量请求的异步端点。通过利用适合您使用案例的推理选项，您可以确保高效的模型部署和推理。有关 SageMaker 推理的更多信息，请[参阅部署模型用于推理](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html)。

**重要**  
SageMaker 推理仅支持全秩自定义模型和已合并 LoRA 的模型。对于未合并 LoRA 的模型及基础模型，请使用 Amazon Bedrock。

## 功能
<a name="nova-sagemaker-inference-features"></a>

以下为 Amazon Nova 模型在 SageMaker 推理中支持的功能：

**模型能力**
+ 文本生成

**部署与扩缩**
+ 支持自定义实例选型的实时端点
+ 自动扩缩：根据流量自动调整算力，优化成本与 GPU 利用率。有关更多信息，请参阅[自动扩缩 Amazon SageMaker 模型](https://docs.aws.amazon.com//sagemaker/latest/dg/endpoint-auto-scaling.html)。
+ 支持流式 API，可实时生成词元

**监控与优化**
+ 集成 Amazon CloudWatch，用于监控与提醒
+ 通过 VPC 配置实现可用区域感知的时延优化

**开发工具**
+ 支持 AWS CLI：有关更多信息，请参阅 [SageMaker AWS CLI 命令参考](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/)。
+  支持通过 SDK 集成笔记本

## 支持的模型与实例
<a name="nova-sagemaker-inference-supported"></a>

创建 SageMaker 推理端点时，可设置两个环境变量来配置部署：`CONTEXT_LENGTH` 和 `MAX_CONCURRENCY`。
+ `CONTEXT_LENGTH`：单个请求的词元最大总长（输入 \$1 输出）
+ `MAX_CONCURRENCY`：端点可处理的最大并发请求数

下表列出了支持的 Amazon Nova 模型、实例类型及对应配置。MAX\$1CONCURRENCY 数值表示在每个 CONTEXT\$1LENGTH 设置下所支持的最大并发数：


****  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/nova/latest/nova2-userguide/nova-model-sagemaker-inference.html)

**注意**  
表格中显示的 MAX\$1CONCURRENCY 数值是对应 CONTEXT\$1LENGTH 设置的上限值。您可以在相同并发数下使用更短的上下文长度，但超出这些上限值将导致 SageMaker 端点创建失败。  
以 Amazon Nova Micro 搭配 ml.g5.12xlarge 为例：  
`CONTEXT_LENGTH=2000`，`MAX_CONCURRENCY=32`→ 有效
`CONTEXT_LENGTH=8000`，`MAX_CONCURRENCY=32`→ 无效（上下文长度 8000 时并发上限为 16）
`CONTEXT_LENGTH=8000`，`MAX_CONCURRENCY=4`→ 有效
`CONTEXT_LENGTH=8000`，`MAX_CONCURRENCY=16`→ 有效
`CONTEXT_LENGTH=10000` → 无效（该实例最大上下文长度为 8000）

## 支持的 AWS 区域
<a name="nova-sagemaker-inference-regions"></a>

下表列出了 Amazon Nova 模型可在 SageMaker 推理中使用的 AWS 区域：


****  

| 区域名称 | 区域代码 | 可用性 | 
| --- | --- | --- | 
| 美国东部（弗吉尼亚州北部） | us-east-1 | 可用 | 
| 美国西部（俄勒冈州） | us-west-2 | 可用 | 

## 支持的区域与容器映像
<a name="nova-sagemaker-inference-container-images"></a>

下表按区域列出了 Amazon Nova 模型在 SageMaker 推理中使用的容器映像 URI。每个区域提供两种映像标签：版本化标签 (`v1.0.0`) 和最新标签 (`SM-Inference-latest`)。生产环境部署时，建议使用版本化标签。


****  

| 区域 | 容器映像 URI | 
| --- | --- | 
| us-east-1 | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-inference-repo:SM-Inference-latest | 
| us-west-2 | 176779409107.dkr.ecr.us-west-2.amazonaws.com/nova-inference-repo:SM-Inference-latest | 

## 最佳实践
<a name="nova-sagemaker-inference-best-practices"></a>

有关在 SageMaker 上部署和管理模型的最佳实践，请参阅 [SageMaker 最佳实践](https://docs.aws.amazon.com//sagemaker/latest/dg/best-practices.html)。

## 支持
<a name="nova-sagemaker-inference-support"></a>

如在 SageMaker 推理中使用 Amazon Nova 模型时遇到问题或需要支持，可通过控制台或 AWS 客户经理联系 AWS Support。

**Topics**
+ [

## 功能
](#nova-sagemaker-inference-features)
+ [

## 支持的模型与实例
](#nova-sagemaker-inference-supported)
+ [

## 支持的 AWS 区域
](#nova-sagemaker-inference-regions)
+ [

## 支持的区域与容器映像
](#nova-sagemaker-inference-container-images)
+ [

## 最佳实践
](#nova-sagemaker-inference-best-practices)
+ [

## 支持
](#nova-sagemaker-inference-support)
+ [

# 入门
](nova-sagemaker-inference-getting-started.md)
+ [

# API 参考
](nova-sagemaker-inference-api-reference.md)
+ [

# 评估在 SageMaker 推理上托管的模型
](nova-eval-on-sagemaker-inference.md)
+ [

# 在 Amazon SageMaker Inference 滥用检测中部署 Amazon Nova Forge 模型
](nova-sagemaker-inference-abuse-detection.md)

# 入门
<a name="nova-sagemaker-inference-getting-started"></a>

本指南介绍如何在 SageMaker 实时端点上部署自定义 Amazon Nova 模型、配置推理参数，并调用模型进行测试。

## 先决条件
<a name="nova-sagemaker-inference-prerequisites"></a>

在 SageMaker 推理上部署 Amazon Nova 模型需满足以下先决条件：
+ 创建 AWS 账户 账户：如尚无账户，请参阅[创建 AWS 账户](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-set-up.html#sign-up-for-aws)。
+ 所需 IAM 权限：确保 IAM 用户或角色已附加以下托管策略：
  + `AmazonSageMakerFullAccess`
  + `AmazonS3FullAccess`
+ 所需 SDK/CLI 版本：以下 SDK 版本已在 SageMaker 推理上通过 Amazon Nova 模型的测试与验证：
  + 适用于基于资源的 API 方式：SageMaker Python SDK v3.0.0\$1 (`sagemaker>=3.0.0`)
  + 适用于直接 API 调用：Boto3 版本 1.35.0\$1 (`boto3>=1.35.0`)。本指南中的示例均采用此方式。
+ 增加服务配额：针对您计划用于 SageMaker Inference 端点（例如 `ml.p5.48xlarge for endpoint usage`）的机器学习实例类型，请求增加 Amazon SageMaker 服务配额。有关受支持实例类型的列表，请参阅 [支持的模型与实例](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported)。如需申请提升配额，请参阅[请求增加配额](https://docs.aws.amazon.com//servicequotas/latest/userguide/request-quota-increase.html)。有关 SageMaker 实例配额的更多信息，请参阅 [SageMaker 端点和配额](https://docs.aws.amazon.com//general/latest/gr/sagemaker.html)。

## 步骤 1：配置 AWS 凭证
<a name="nova-sagemaker-inference-step1"></a>

使用以下方法之一管理配置 AWS 凭证：

**选项 1：AWS CLI（建议）**

```
aws configure
```

提示后，输入 AWS 访问密钥 ID、私有密钥和默认区域名称。

**选项 2：AWS 凭证文件**

创建或编辑 `~/.aws/credentials`：

```
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
```

**选项 3：环境变量**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
```

**注意**  
有关 AWS 凭证的更多信息，请参阅[配置与凭证文件设置](https://docs.aws.amazon.com//cli/latest/userguide/cli-configure-files.html)。

**初始化 AWS 客户端**

创建包含以下代码的 Python 脚本或笔记本文件，用于初始化 AWS SDK 并验证凭证：

```
import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")
```

如果身份验证成功，您会看到输出结果中包含自己的 AWS 账户 ID 确认信息。

## 步骤 2：创建 SageMaker 执行角色
<a name="nova-sagemaker-inference-step2"></a>

SageMaker 执行角色是一种 IAM 角色，用于授予 SageMaker 权限，使其能够代表您访问 AWS 资源（例如存储模型构件的 Amazon S3 存储桶、用于日志记录的 CloudWatch）。

**创建执行角色**

**注意**  
创建 IAM 角色需要具备 `iam:CreateRole` 和 `iam:AttachRolePolicy` 权限。在继续操作前，确保自己的 IAM 用户或角色已拥有这些权限。

以下代码将创建一个具备部署 Amazon Nova 自定义模型所需权限的 IAM 角色：

```
import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")
```

**使用现有执行角色（可选）**

如果已有 SageMaker 执行角色，可直接复用：

```
# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"
```

查找账户中现有的 SageMaker 角色：

```
iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")
```

**重要**  
执行角色必须与 `sagemaker.amazonaws.com` 建立信任关系，且拥有访问 Amazon S3 和 SageMaker 资源的权限。

有关 SageMaker 执行角色的更多信息，请参阅 [SageMaker 角色](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-roles.html)。

## 步骤 3：配置模型参数
<a name="nova-sagemaker-inference-step3"></a>

为您的 Amazon Nova 模型配置部署参数。这些设置将控制模型行为、资源分配及推理特性。有关支持的实例类型，以及各实例对应的 CONTEXT\$1LENGTH 和 MAX\$1CONCURRENCY 取值，请参阅[支持的模型与实例](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported)。

**必填参数**
+ `IMAGE`：Amazon Nova 推理容器的 Docker 容器映像 URI。该地址由 AWS 提供。
+ `CONTEXT_LENGTH`：模型上下文长度。
+ `MAX_CONCURRENCY`：每轮迭代的最大序列数；用于限制 GPU 上单批次可并发处理的独立用户请求（提示词）数量。取值范围：大于 0 的整数。

**可选生成参数**
+ `DEFAULT_TEMPERATURE`：控制生成内容的随机性。取值范围：0.0 到 2.0（0.0 = 确定性生成，数值越高随机性越强）。
+ `DEFAULT_TOP_P`：核采样词元选择阈值。取值范围：1e-10 到 1.0。
+ `DEFAULT_TOP_K`：将词元选择范围限制为概率最高的前 K 个词元。取值范围：大于等于 -1 的整数（-1 = 无限制）。
+ `DEFAULT_MAX_NEW_TOKENS`：：响应中生成的最大词元数（即最大输出词元数）。取值范围：大于等于 1 的整数。
+ `DEFAULT_LOGPROBS`：每个词元返回的对数概率数量。取值范围：1 到 20 的整数。

**配置部署**

```
# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Model Parameters
CONTEXT_LENGTH = "16000"       # Maximum total context length
MAX_CONCURRENCY = "2"          # Maximum concurrent sequences

# Optional: Default generation parameters (uncomment to use)
DEFAULT_TEMPERATURE = "0.0"   # Deterministic output
DEFAULT_TOP_P = "1.0"         # Consider all tokens
# DEFAULT_TOP_K = "50"        # Uncomment to limit to top 50 tokens
# DEFAULT_MAX_NEW_TOKENS = "2048"  # Uncomment to set max output tokens
# DEFAULT_LOGPROBS = "1"      # Uncomment to enable log probabilities

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
}

# Add optional parameters if defined
if 'DEFAULT_TEMPERATURE' in globals():
    environment['DEFAULT_TEMPERATURE'] = DEFAULT_TEMPERATURE
if 'DEFAULT_TOP_P' in globals():
    environment['DEFAULT_TOP_P'] = DEFAULT_TOP_P
if 'DEFAULT_TOP_K' in globals():
    environment['DEFAULT_TOP_K'] = DEFAULT_TOP_K
if 'DEFAULT_MAX_NEW_TOKENS' in globals():
    environment['DEFAULT_MAX_NEW_TOKENS'] = DEFAULT_MAX_NEW_TOKENS
if 'DEFAULT_LOGPROBS' in globals():
    environment['DEFAULT_LOGPROBS'] = DEFAULT_LOGPROBS

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")
```

**配置部署专属参数**

接下来为您的 Amazon Nova 模型部署配置专属参数，包括模型构件存储位置及实例类型选择。

**设置部署标识符**

```
# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"
```

**指定模型构件存储位置**

提供训练完成的 Amazon Nova 模型构件所在的 Amazon S3 URI。该地址应为模型训练或微调作业的输出位置。

```
# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"
```

**选择模型变体与实例类型**

```
# Configure model variant and instance type
TESTCASE = {
    "model": "lite2",              # Options: micro, lite, lite2
    "instance": "ml.p5.48xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")
```

**命名规范**

代码会自动为 AWS 资源生成统一的命名：
+ 模型名称：`{JOB_NAME}-{model}-{instance-type}`
+ 端点配置：`{MODEL_NAME}-Config`
+ 端点名称：`{MODEL_NAME}-Endpoint`

## 步骤 4：创建 SageMaker 模型与端点配置
<a name="nova-sagemaker-inference-step4"></a>

本步骤中，您将创建两项核心资源：一是关联 Amazon Nova 模型构件的 SageMaker 模型对象，二是定义模型部署方式的端点配置。

**SageMaker 模型**：封装推理容器映像、模型构件存储位置及环境配置的模型对象。该资源可复用，能部署到多个端点。

**端点配置**：定义部署的基础设施设置，包括实例类型、实例数量、模型变体。通过该配置，您可将部署设置与模型本身分开管理。

**创建 SageMaker 模型**

以下代码将创建一个关联您 Amazon Nova 模型构件的 SageMaker 模型：

```
try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")
```

关键参数：
+ `ModelName`：模型的唯一标识符
+ `Image`：Amazon Nova 推理所用的 Docker 容器映像 URI
+ `ModelDataSource`：模型构件的 Amazon S3 存储位置
+ `Environment`：步骤 3 中配置的环境变量
+ `ExecutionRoleArn`：步骤 2 中创建的 IAM 角色 ARN
+ `EnableNetworkIsolation`：设为 True 可增强安全性（禁止容器发起出站网络请求）

**创建端点配置**

接下来，创建定义部署基础设施的端点配置：

```
# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")
```

关键参数：
+ `VariantName`：该模型变体的标识符（单模型部署时使用“primary”）
+ `ModelName`：关联上文创建的模型
+ `InitialInstanceCount`：部署的实例数量（初始可设为 1，后续则按需扩缩）
+ `InstanceType`：步骤 3 中选定的机器学习实例类型

**验证资源创建**

您可验证资源是否创建成功：

```
# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")
```

## 步骤 5：部署端点
<a name="nova-sagemaker-inference-step5"></a>

下一步，您将通过创建 SageMaker 实时端点来部署 Amazon Nova 模型。该端点将托管您的模型，并提供安全的 HTTPS 端点用于发起推理请求。

端点创建通常需要 15–30 分钟，此过程中 AWS 会完成基础设施配置、模型构件下载及推理容器初始化。

**创建端点**

```
import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")
```

**监控端点创建进度**

以下代码会轮询端点状态，直至部署完成：

```
# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds
```

**验证端点就绪状态**

当端点状态变为 InService 后，您可验证其配置信息：

```
# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break
```

**排查端点创建失败问题**

常见失败原因：
+ **容量不足**：请求的实例类型在您的区域暂不可用
  + 解决方法：尝试更换实例类型，或提交配额提升申请
+ **IAM 权限问题**：执行角色缺少必要权限
  + 解决方法：验证该角色是否拥有访问 Amazon S3 模型构件的权限，以及 SageMaker 相关必要权限
+ **未找到模型构件**：Amazon S3 URI 错误或无法访问
  + 解决方法：验证 Amazon S3 URI 正确性，检查存储桶权限，并确认当前操作的区域正确
+ **资源限额超限**：账户的端点/实例数量超出服务限额
  + 解决方法：通过“服务配额”或 AWS Support 提交配额提升申请

**注意**  
若需删除创建失败的端点并重新部署：  

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## 步骤 6：调用端点
<a name="nova-sagemaker-inference-step6"></a>

当端点状态变为 InService 后，即可发送推理请求，通过 Amazon Nova 模型生成预测结果。SageMaker 支持两类端点调用方式：同步端点（含流式/非流式模式的实时调用）和异步端点（基于 Amazon S3 的批量处理）。

**设置运行时客户端**

创建带有合理超时设置的 SageMaker Runtime 客户端：

```
import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)
```

**编写通用推理函数**

以下函数可同时处理流式与非流式请求：

```
def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Body=body
            )
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            response = runtime_client.invoke_endpoint(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Accept='application/json',
                Body=body
            )
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")
```

**示例 1：非流式对话补全**

采用对话格式实现多轮交互：

```
# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "reasoning_effort": "low",  # Options: "low", "high"
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)
```

**示例响应**：

```
{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}
```

**示例 2：简单文本补全**

采用补全格式实现基础文本生成：

```
# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)
```

**示例响应**：

```
{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}
```

**示例 3：流式对话补全**

```
# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "reasoning_effort": "high",  # For more detailed reasoning
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)
```

**流式传输输出示例：**

```
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]
```

**示例 4：多模态对话补全**

采用多模态格式处理图像与文本混合输入：

```
# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)
```

**示例响应**：

```
{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}
```

## 步骤 7：清理资源（可选）
<a name="nova-sagemaker-inference-step7"></a>

为避免产生不必要的费用，请删除本教程中创建的 AWS 资源。SageMaker 端点在运行期间会持续计费，即使您未主动发起推理请求。

**重要**  
删除资源是永久操作，无法撤销。继续操作之前，确认不再需要这些资源。

**删除端点**

```
import boto3

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")
```

**注意**  
端点删除为异步操作。您可以监控删除状态：  

```
import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break
```

**删除现有端点配置**

端点删除完成后，删除端点配置：

```
try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")
```

**删除模型**

删除 SageMaker 模型对象：

```
try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")
```

# API 参考
<a name="nova-sagemaker-inference-api-reference"></a>

SageMaker 上的 Amazon Nova 模型，使用标准 SageMaker Runtime API 进行推理。有关完整的 API 文档，请参阅[测试已部署模型](https://docs.aws.amazon.com//sagemaker/latest/dg/realtime-endpoints-test-endpoints.html)。

## 端点调用
<a name="nova-sagemaker-inference-api-invocation"></a>

SageMaker 上的 Amazon Nova 模型支持两种调用方式：
+ **同步调用**：使用 [InvokeEndpoint API](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) 处理实时、非流式推理请求。
+ **流式调用**：使用 [InvokeEndpointWithResponseStream API](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html) 处理实时流式推理请求。

## 请求格式
<a name="nova-sagemaker-inference-api-request"></a>

Amazon Nova 模型支持三种请求格式：

**对话补全格式**

该格式用于对话交互：

```
{
  "messages": [
    {"role": "user", "content": "string"}
  ],
  "max_tokens": integer,
  "max_completion_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": boolean,
  "top_logprobs": integer,
  "reasoning_effort": "low" | "high",
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**文本补全格式**

该格式用于简单文本生成：

```
{
  "prompt": "string",
  "max_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": integer,
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**多模态对话补全格式**

该格式用于图像与文本混合输入：

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": integer,
  "temperature": float,
  "top_p": float,
  "stream": boolean
}
```

**请求参数**
+ `messages`（数组）：用于对话补全格式。由包含 `role` 和 `content` 字段的消息对象数组组成。content 为字符串表示纯文本输入，为数组表示多模态输入。
+ `prompt`（字符串）：用于文本补全格式。用于生成内容的输入文本。
+ `max_tokens`（整数）：响应中生成的最大词元数。取值范围：≥ 1。
+ `max_completion_tokens`（整数）：max\$1tokens 的替代参数，用于对话补全。生成的最大补全词元数。
+ `temperature`（浮点数）：控制生成内容的随机性。取值范围：0.0 到 2.0（0.0 = 确定性生成，2.0 = 最大随机性）。
+ `top_p`（浮点数）：核采样阈值。取值范围：1e-10 到 1.0。
+ `top_k`（整数）：将词元选择范围限制为概率最高的前 K 个词元。取值范围：大于等于 -1（-1 = 无限制）。
+ `stream`（布尔值）：是否流式返回响应。`true` 为流式，`false` 为非流式。
+ `logprobs`（布尔值/整数）：对话补全使用布尔值。文本补全使用整数，表示返回的对数概率数量。取值范围：1 到 20。
+ `top_logprobs`（整数）：返回对数概率的概率最高词元数量（仅对话补全）。
+ `reasoning_effort`（字符串）：推理强度等级。选项：low、high（仅限 Nova 2 Lite 自定义模型的聊天补全）。
+ `allowed_token_ids`（数组）：允许生成的词元 ID 列表。用于将输出限制为指定词元。
+ `truncate_prompt_tokens`（整数）：若提示词超出限制，则截断为该词元数量。
+ `stream_options`（对象）：流式响应选项。包含布尔值 `include_usage`，用于在流式响应中包含词元用量信息。

## 响应格式
<a name="nova-sagemaker-inference-api-response"></a>

响应格式取决于调用方式与请求类型：

**对话补全响应（非流式）**

适用于同步对话补全请求：

```
{
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": [9906, 0, 358, 2157, 1049, 11, 1309, 345, 369, 6464, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  },
  "prompt_token_ids": [9906, 0, 358]
}
```

**文本补全响应（非流式）**

适用于同步文本补全请求：

```
{
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris, the capital and most populous city of France.",
      "logprobs": {
        "tokens": ["Paris", ",", " the", " capital"],
        "token_logprobs": [-0.31725305, -0.07918124, -0.12345678, -0.23456789],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403,
            "Rome": -2.1234567
          },
          {
            ",": -0.07918124,
            " is": -1.2345678
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_token_ids": [464, 6864, 315, 4881, 374],
      "token_ids": [3915, 11, 279, 6864, 323, 1455, 95551, 3363, 315, 4881, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}
```

**对话补全流式响应**

适用于流式对话补全请求，响应以服务器发送事件（SSE）形式返回：

```
data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": "Hello",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              }
            ]
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null,
  "prompt_token_ids": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "! I'm"
      },
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}

data: [DONE]
```

**文本补全流式响应**

适用于流式文本补全请求：

```
data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris",
      "logprobs": {
        "tokens": ["Paris"],
        "token_logprobs": [-0.31725305],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": ", the capital",
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "",
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16
  }
}

data: [DONE]
```

**响应字段说明**
+ `id`：补全结果的唯一标识符
+ `object`：返回对象类型（“chat.completion”“text\$1completion”“chat.completion.chunk”）
+ `created`：生成补全结果的 Unix 时间戳
+ `model`：用于生成补全结果的模型
+ `choices`：补全结果数组
+ `usage`：词元用量信息，包含提示词词元、补全词元和总词元数
+ `logprobs`：词元的对数概率信息（需主动请求）
+ `finish_reason`模型停止生成的原因（“stop”“length”“content\$1filter”）
+ `delta`：流式响应中的增量内容
+ `reasoning`：使用 reasoning\$1effort 时的推理内容
+ `token_ids`：生成文本对应的词元 ID 数组

如需完整的 API 文档，请参阅 [InvokeEndpoint API 参考](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html)和 [InvokeEndpointWithResponseStream API 参考](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html)。

# 评估在 SageMaker 推理上托管的模型
<a name="nova-eval-on-sagemaker-inference"></a>

本指南介绍如何使用开源评估框架 [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai)，对部署在 SageMaker 推理端点上的自定义 Amazon Nova 模型进行评估。

**注意**  
有关实操演示，请参阅 [SageMaker Inspect AI 快速入门笔记本](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb)。

## 概述
<a name="nova-eval-sagemaker-overview"></a>

您可以使用人工智能研究社区的标准化基准，对部署在 SageMaker 端点上的自定义 Amazon Nova 模型进行评估。该方式支持：
+ 大规模评估自定义 Amazon Nova 模型（微调、蒸馏或其他适配版本）
+ 通过多端点实例并行推理执行评估
+ 使用 MMLU、TruthfulQA、HumanEval 等基准对比模型性能
+ 与现有的 SageMaker 基础设施集成

## 支持的模型
<a name="nova-eval-sagemaker-supported-models"></a>

SageMaker 推理提供程序支持以下模型/端点类型：
+ Amazon Nova 系列模型（Nova Micro、Nova Lite、Nova Lite 2）
+ 通过 vLLM 或兼容 OpenAI 协议的推理服务器部署的模型
+ 任何支持 OpenAI Chat Completions API 格式的端点

## 先决条件
<a name="nova-eval-sagemaker-prerequisites"></a>

在开始之前，请确保您满足以下条件：
+ 拥有 AWS 账户，且该账户具备创建和调用 SageMaker 端点的权限
+ 已通过 AWS CLI CLI、环境变量或 IAM 角色配置好 AWS 凭证
+ Python 3.9 或更高版本

**所需的 IAM 权限**

您的 IAM 用户或角色需具备以下权限：

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:InvokeEndpoint",
        "sagemaker:DescribeEndpoint"
      ],
      "Resource": "arn:aws:sagemaker:*:*:endpoint/*"
    }
  ]
}
```

## 步骤 1：部署 SageMaker端点
<a name="nova-eval-sagemaker-step1"></a>

运行评估前，需先部署承载目标模型的 SageMaker 推理端点。

有关使用 Amazon Nova 模型创建 SageMaker 推理端点的操作指引，请参阅[入门](nova-sagemaker-inference-getting-started.md)。

当端点状态变为 `InService` 后，请记录端点名称，以便在评估命令中使用。

## 步骤 2：安装评估依赖项
<a name="nova-eval-sagemaker-step2"></a>

创建 Python 虚拟环境，并安装所需的依赖包。

```
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install uv for faster package installation
pip install uv

# Install Inspect AI and evaluation benchmarks
uv pip install inspect-ai inspect-evals

# Install AWS dependencies
uv pip install aioboto3 boto3 botocore openai
```

## 步骤 3：配置 AWS 凭证
<a name="nova-eval-sagemaker-step3"></a>

选择以下任一身份验证方式：

**选项 1：AWS CLI（建议）**

```
aws configure
```

提示后，输入 AWS 访问密钥 ID、秘密访问密钥和默认区域名称。

**选项 2：环境变量**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2
```

**选项 3：IAM 角色**

若在 Amazon EC2 或 SageMaker 笔记本中运行，将自动使用实例的 IAM 角色。

**验证凭证**

```
import boto3

sts = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Account: {identity['Account']}")
print(f"User/Role: {identity['Arn']}")
```

## 步骤 4：安装 SageMaker 提供程序
<a name="nova-eval-sagemaker-step4"></a>

SageMaker 提供程序用于让 Inspect AI 与您的 SageMaker 端点通信。[快速入门笔记本](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb)中已简化该提供程序的安装流程。

## 步骤 5：下载评估基准
<a name="nova-eval-sagemaker-step5"></a>

克隆 Inspect Evals 存储库以获取标准评估基准：

```
git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
```

该存储库包含的评估基准包括：
+ MMLU 和 mmlu-Pro（知识与推理）
+ TruthfulQA（真实性）
+ HumanEval（代码生成）
+ GSM8K（数学推理）

## 步骤 6：运行评估
<a name="nova-eval-sagemaker-step6"></a>

使用您的 SageMaker 端点运行评估：

```
cd inspect_evals/src/inspect_evals/

inspect eval mmlu_pro/mmlu_pro.py \
  --model sagemaker/my-nova-endpoint \
  -M region_name=us-west-2 \
  --max-connections 256 \
  --max-retries 100 \
  --display plain
```

**关键参数**


| 参数 | 默认值 | 说明 | 
| --- | --- | --- | 
| --max-connections | 10 | 发送到端点的并行请求数。需随实例数量调整（例如：10 个实例 x 25 = 250）。 | 
| --max-retries | 3 | 失败请求的重试次数。大规模评估建议设为 50–100。 | 
| -M region\$1name | us-east-1 | 端点部署所在的 AWS 区域。 | 
| -M read\$1timeout | 600 | 请求超时时间（单位：秒）。 | 
| -M connect\$1timeout | 60 | 连接超时时间（单位：秒）。 | 

**调优建议**

针对多实例端点：

```
# 10-instance endpoint example
--max-connections 250   # ~25 connections per instance
--max-retries 100       # Handle transient errors
```

`--max-connections` 设置过高可能导致端点过载并触发节流；设置过低则会造成资源利用率不足。

## 步骤 7：查看结果
<a name="nova-eval-sagemaker-step7"></a>

启动 Inspect AI 可视化工具分析评估结果：

```
inspect view
```

该工具将展示以下内容：
+ 整体得分与指标
+ 含模型响应的单样本评估结果
+ 错误分析与失败模式

## 管理终端节点
<a name="nova-eval-sagemaker-managing-endpoints"></a>

**更新端点**

如需使用新模型或新配置更新现有端点：

```
import boto3

sagemaker = boto3.client('sagemaker', region_name=REGION)

# Create new model and endpoint configuration
# Then update the endpoint
sagemaker.update_endpoint(
    EndpointName=EXISTING_ENDPOINT_NAME,
    EndpointConfigName=NEW_ENDPOINT_CONFIG_NAME
)
```

**删除端点**

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## 添加自定义评估基准
<a name="nova-eval-sagemaker-custom-benchmarks"></a>

您可通过以下工作流向 Inspect AI 添加新的评估基准：

1. 研究基准数据集格式与评估指标

1. 参考 `inspect_evals/` 中类似的实现

1. 创建任务文件，将数据集记录转换为 Inspect AI 样本

1. 实现对应的求解器与评分器

1. 使用小规模测试运行进行验证

示例任务结构：

```
from inspect_ai import Task, task
from inspect_ai.dataset import hf_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice

@task
def my_benchmark():
    return Task(
        dataset=hf_dataset("dataset_name", split="test"),
        solver=multiple_choice(),
        scorer=choice()
    )
```

## 问题排查
<a name="nova-eval-sagemaker-troubleshooting"></a>

**常见问题**

**端点节流或超时**
+ 降低 `--max-connections`
+ 提高 `--max-retries`
+ 查看端点的 CloudWatch 指标，确认是否存在容量问题

**身份验证错误**
+ 确认 AWS 凭证配置正确
+ 检查 IAM 权限是否包含 `sagemaker:InvokeEndpoint`

**模型错误**
+ 确认端点处于 `InService` 状态
+ 检查模型是否支持 OpenAI Chat Completions API 格式

## 相关资源
<a name="nova-eval-sagemaker-related-resources"></a>
+ [Inspect AI 文档](https://inspect.ai-safety-institute.org.uk/)
+ [Inspect Evals 存储库](https://github.com/UKGovernmentBEIS/inspect_evals)
+ [SageMaker 开发人员指南](https://docs.aws.amazon.com//sagemaker/latest/dg/whatis.html)
+ [署模型用于推理](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html)
+ [配置 AWS CLI](https://docs.aws.amazon.com//cli/latest/userguide/cli-chap-configure.html)

# 在 Amazon SageMaker Inference 滥用检测中部署 Amazon Nova Forge 模型
<a name="nova-sagemaker-inference-abuse-detection"></a>

AWS 致力于以负责任的方式使用人工智能。为防止潜在滥用，在 Amazon SageMaker Inference 中部署 Amazon Nova Forge 模型时，SageMaker Inference 将启用自动化滥用检测机制，用于识别可能违反 AWS [可接受使用政策](https://aws.amazon.com/aup/)（AUP）和服务条款（包括[负责任的人工智能政策](https://aws.amazon.com/ai/responsible-ai/policy/)）相关规定的行为。

我们的滥用检测机制是完全自动化的，因此无需人工审核或者访问用户输入或模型输出。

自动化滥用检测包括：
+ **对内容进行分类**：我们使用分类器来检测用户输入和模型输出中的有害内容（例如煽动暴力的内容）。分类器是一种处理模型输入和输出，并指定危害类型和置信度级别的算法。我们可能会在 Amazon Nova Forge 模型使用过程中运行这些分类器。分类过程是自动化的，不涉及对用户输入或模型输出进行人工审核。
+ **识别模式**：我们使用分类器指标来识别潜在的违规行为和反复出现的行为。我们可能会编译和分享匿名的分类器指标。Amazon SageMaker Inference 不会存储用户输入或模型输出。
+ **检测并屏蔽儿童性虐待素材（CSAM）**：您需对自己（及终端用户）上传到 Amazon SageMaker Inference 的内容承担责任，并确保相关内容不含非法图像。为阻止 CSAM 的传播，当在 Amazon SageMaker Inference 中部署 Amazon Nova Forge 模型时，SageMaker Inference 可能会使用自动化滥用检测机制（例如哈希匹配技术或分类器）来检测明显的 CSAM。如果 Amazon SageMaker Inference 在您的图片输入中检测到明显的 CSAM，Amazon SageMaker Inference 将拦截该请求，并且您将收到一条自动发出的错误消息。Amazon SageMaker Inference 还可能向美国国家失踪与受虐儿童中心（NCMEC）或相关机构提交报告。我们高度重视 CSAM 问题，并将持续改进我们的检测、拦截和报告机制。您可能需要按照相关法律采取其他措施，并对这些行为负责。

自动化滥用检测机制识别出潜在的违规行为之后，我们可能会要求您说明自己如何使用 Amazon SageMaker Inference，以及如何遵守我们的服务条款。如果您未作出回应、不愿意或无法遵守这些条款或策略，AWS 可能会暂停您对 Amazon SageMaker Inference 的访问。如果我们的自动化测试检测到模型响应未遵守我们的条款和政策，您可能仍需要为失败的推理作业付费。

如有其他疑问，请联系 AWS Support。有关更多信息，请参阅 [Amazon SageMaker 常见问题](https://aws.amazon.com/sagemaker/ai/faqs/)。