前提条件ステップ 1: AWS 認証情報を設定するステップ 2: SageMaker 実行ロールを作成するステップ 3: モデルパラメータを設定するステップ 4: SageMaker リソースを作成し、エンドポイントをデプロイするステップ 5: エンドポイントを呼び出すステップ 6: リソースをクリーンアップする (オプション)

開始方法

このガイドでは、カスタマイズされた Amazon Nova モデルを SageMaker リアルタイムエンドポイントにデプロイし、推論パラメータを設定し、テスト用にモデルを呼び出す方法について説明します。

前提条件

Amazon Nova モデルを SageMaker 推論にデプロイするための前提条件は以下のとおりです。

AWS アカウントの作成 - まだ作成していない場合は、「AWS アカウントを作成する」を参照してください。
必要な IAM アクセス許可 - IAM ユーザーまたはロールに以下の管理ポリシーがアタッチされていることを確認します。
- AmazonSageMakerFullAccess
- AmazonS3FullAccess
必要な SDK/CLI バージョン - 以下の SDK バージョンは、SageMaker 推論の Amazon Nova モデルでテストおよび検証されています。
- リソースベースの API アプローチ用の SageMaker Python SDK v3.0.0+ (sagemaker>=3.0.0)
- 直接 API コール用の Boto3 バージョン 1.35.0+ (boto3>=1.35.0)。このガイドの例では、このアプローチを使用します。
サービスクォータの引き上げ – SageMaker 推論エンドポイント用に使用する予定の ML インスタンスタイプに対する Amazon SageMaker サービスクォータの引き上げをリクエストします (例: ml.p5.48xlarge for endpoint usage)。サポートされているインスタンスタイプについては、「サポートされているモデルとインスタンス」を参照してください。引き上げをリクエストするには、「クォータ引き上げのリクエスト」を参照してください。SageMaker インスタンスのクォータについては、「SageMaker エンドポイントとクォータ」を参照してください。

ヒント

エンドツーエンドの迅速なデプロイのために、カスタム Nova モデル SageMaker 推論ノートブックを実行して、カスタマイズされた Amazon Nova モデルを単一のノートブック内の SageMaker 推論にデプロイできます。

ステップ 1: AWS 認証情報を設定する

以下のいずれかの方法を使用して、AWS 認証情報を設定します。

オプション 1: AWS CLI (推奨)


aws configure

プロンプトが表示されたら、AWS アクセスキー、シークレットキー、およびデフォルトのリージョンを入力します。

オプション 2: AWS 認証情報ファイル

~/.aws/credentials を作成または編集します。


[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

オプション 3: 環境変数


export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key

注記

AWS 認証情報の詳細については、「設定ファイルと認証情報ファイルの設定」を参照してください。

AWS クライアントを初期化する

以下のコードを使用して Python スクリプトまたはノートブックを作成し、AWS SDK を初期化して認証情報を検証します。


import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")

認証が成功すると、AWS アカウント ID を確認する出力が表示されます。

ステップ 2: SageMaker 実行ロールを作成する

SageMaker 実行ロールは、モデルアーティファクト用の Amazon S3 バケットやログ記録用の CloudWatch など、ユーザーに代わって AWS リソースにアクセスするためのアクセス許可を SageMaker に付与する IAM ロールです。

実行ロールを作成する

注記

IAM ロールを作成するには、iam:CreateRole および iam:AttachRolePolicy アクセス許可が必要です。先に進む前に、IAM ユーザーまたはロールにこれらのアクセス許可があることを確認してください。

以下のコードは、Amazon Nova カスタマイズモデルをデプロイするために必要なアクセス許可を持つ IAM ロールを作成します。


import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")

既存の実行ロールを使用する (オプション)

SageMaker 実行ロールが既にある場合は、代わりに使用できます。


# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"

アカウント内の既存の SageMaker ロールを検索するには:


iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")

重要

実行ロールには、Amazon S3 および SageMaker リソースにアクセスするための sagemaker.amazonaws.com との信頼関係とアクセス許可が必要です。

SageMaker 実行ロールの詳細については、「SageMaker ロール」を参照してください。

ステップ 3: モデルパラメータを設定する

Amazon Nova モデルのデプロイパラメータを設定します。これらの設定は、モデルの動作、リソース割り当て、推論特性を制御します。サポートされているインスタンスタイプと、それぞれでサポートされている CONTEXT_LENGTH および MAX_CONCURRENCY 値のリストについては、「サポートされているモデルとインスタンス」を参照してください。サンプリングのデフォルト、投機的デコーディング、量子化などの追加のコンテナ機能の完全なリストについては、「推論コンテナの機能」を参照してください。

必須パラメータ

IMAGE: Amazon Nova 推論コンテナの Docker コンテナイメージ URI。これは AWS によって提供されます。
CONTEXT_LENGTH: モデルコンテキストの長さ。
MAX_CONCURRENCY: イテレーションあたりのシーケンスの最大数。GPU 上の 1 つのバッチ内で同時に処理できる個々のユーザーリクエスト (プロンプト) の数に制限を設定します。範囲: 0 より大きい整数。

デプロイを設定する


# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Required parameters
CONTEXT_LENGTH = "8000"        # Maximum total context length
MAX_CONCURRENCY = "8"          # Maximum concurrent sequences

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
    # Optional: add container feature environment variables here.
    # See "Inference Container Features" for the full list.
    # Examples:
    # 'DEFAULT_TEMPERATURE': '0.7',
    # 'DEFAULT_MAX_NEW_TOKENS': '512',
    # 'QUANTIZATION_DTYPE': 'fp8',
}

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")

デプロイ固有のパラメータを設定する

ここで、モデルアーティファクトの場所やインスタンスタイプの選択など、Amazon Nova モデルデプロイの特定のパラメータを設定します。

デプロイ識別子を設定する


# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"

モデルアーティファクトの場所を指定する

トレーニングされた Amazon Nova モデルアーティファクトが保存されている Amazon S3 URI を指定します。これは、モデルトレーニングまたはファインチューニングジョブの出力場所である必要があります。


# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"

モデルバリアントとインスタンスタイプを選択する


# Configure model variant and instance type
TESTCASE = {
    "model": "micro",              # Options: micro, lite, lite2
    "instance": "ml.g5.12xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")

命名規則

コードは、AWS リソースの一貫した名前を自動的に生成します。

モデル名: {JOB_NAME}-{model}-{instance-type}
エンドポイント設定: {MODEL_NAME}-Config
エンドポイント名: {MODEL_NAME}-Endpoint

ステップ 4: SageMaker リソースを作成し、エンドポイントをデプロイする

SageMaker は、リアルタイムエンドポイントにモデルをデプロイするための 2 つのアプローチを提供します。ユースケースに合ったアプローチを選択します。

推論コンポーネント (推奨): モデルを推論コンポーネントとしてエンドポイントにデプロイします。このアプローチにより、単一のエンドポイントで複数のモデルをホストし、モデルを個別にスケールして、リソース使用率を最適化できます。
単一モデルエンドポイント: モデルオブジェクトとエンドポイント設定を使用して、単一のモデルをエンドポイントに直接デプロイします。このアプローチはセットアップが簡単で、エンドポイントごとに 1 つのモデルのみを必要とする開発、テスト、またはワークロードに適しています。

オプション A: 推論コンポーネントを使用した作成

推論コンポーネントでは、まずエンドポイントを作成し、そのエンドポイントに推論コンポーネントとしてモデルをデプロイします。これにより、モデルがエンドポイントインフラストラクチャから切り離され、柔軟性が向上します。

エンドポイント設定を作成する

モデルを指定せずにインフラストラクチャを定義するエンドポイント設定を作成します。インスタンスタイプと数はエンドポイントレベルで管理されます。


# Create Endpoint Configuration for inference components
INFERENCE_COMPONENT_NAME = MODEL_NAME + "-IC"

try:
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[
            {
                'VariantName': 'primary',
                'InstanceType': INSTANCE_TYPE,
                'InitialInstanceCount': 1,
                'RoutingConfig': {
                    'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
                }
            }
        ],
        Tags=[
            {
                'Key': 'sagemaker:nova-inference-component',
                'Value': 'true'
            }
        ]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")

except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")

エンドポイントを作成してデプロイする


import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")

# Wait for endpoint to be InService
print("Waiting for endpoint to be InService...")
print("This typically takes 5-10 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print(f"\nEndpoint '{ENDPOINT_NAME}' is ready.")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            break
        else:
            print(f"Status: {status}")
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)

推論コンポーネントを作成する

エンドポイントが InService になったら、Amazon Nova モデルを推論コンポーネントとしてデプロイします。


try:
    ic_response = sagemaker.create_inference_component(
        InferenceComponentName=INFERENCE_COMPONENT_NAME,
        EndpointName=ENDPOINT_NAME,
        VariantName='primary',
        Specification={
            'Container': {
                'Image': IMAGE,
                'ArtifactUrl': MODEL_S3_LOCATION,
                'Environment': environment
            },
            'ComputeResourceRequirements': {
                'NumberOfCpuCoresRequired': 15,
                'NumberOfAcceleratorDevicesRequired': 4,
                'MinMemoryRequiredInMb': 25000
            }
        },
        RuntimeConfig={
            'CopyCount': 1
        }
    )
    print("Inference component creation initiated!")
    print(f"Inference Component ARN: {ic_response['InferenceComponentArn']}")

except sagemaker.exceptions.ClientError as e:
    print(f"Error creating inference component: {e}")

主要パラメータ:

InferenceComponentName: 推論コンポーネントの一意の識別子
EndpointName: コンポーネントをデプロイするエンドポイント
Image: Amazon Nova 推論の Docker コンテナイメージ URI
ArtifactUrl: モデルアーティファクトの Amazon S3 の場所
Environment: ステップ 3 で設定した環境変数
NumberOfCpuCoresRequired: モデルコピーごとに必要な CPU コアの数
NumberOfAcceleratorDevicesRequired: モデルコピーごとに必要なアクセラレーターデバイス (GPU) の数
MinMemoryRequiredInMb: モデルコピーごとに必要な最小メモリ(MB)
CopyCount: デプロイするモデルコピーの数

推論コンポーネントのデプロイをモニタリングする


# Wait for inference component to be InService
print("Waiting for inference component deployment...")
print("This typically takes 10-20 minutes as the model is loaded...\n")

while True:
    try:
        ic_desc = sagemaker.describe_inference_component(
            InferenceComponentName=INFERENCE_COMPONENT_NAME
        )
        ic_status = ic_desc['InferenceComponentStatus']
        
        if ic_status == 'Creating':
            print(f"⏳ Status: {ic_status} - Loading model artifacts...")
        elif ic_status == 'InService':
            print(f"✅ Status: {ic_status}")
            print(f"\nInference component '{INFERENCE_COMPONENT_NAME}' is ready!")
            break
        elif ic_status == 'Failed':
            print(f"❌ Status: {ic_status}")
            print(f"Failure Reason: {ic_desc.get('FailureReason', 'Unknown')}")
            break
        else:
            print(f"Status: {ic_status}")
    except Exception as e:
        print(f"Error checking inference component status: {e}")
        break
    
    time.sleep(30)

注記

ステップ 5 でエンドポイントを呼び出すときは、呼び出しコールに InferenceComponentName パラメータを含める必要があります。詳細については、「ステップ 5」を参照。

オプション B: 単一モデルエンドポイントを使用した作成

単一モデルエンドポイントでは、SageMaker モデルオブジェクト、エンドポイント設定を作成し、エンドポイントをデプロイします。このアプローチでは、モデルをエンドポイント設定に直接パッケージ化します。

SageMaker モデルを作成する

以下のコードは、Amazon Nova モデルアーティファクトを参照する SageMaker モデルを作成します。


try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")

主要パラメータ:

ModelName: モデルの一意の識別子
Image: Amazon Nova 推論の Docker コンテナイメージ URI
ModelDataSource: モデルアーティファクトの Amazon S3 の場所
Environment: ステップ 3 で設定した環境変数
ExecutionRoleArn: ステップ 2 の IAM ロール
EnableNetworkIsolation: セキュリティを強化するために True に設定 (コンテナがアウトバウンドネットワークコールを実行できないようにします)

エンドポイント設定を作成する

次に、デプロイインフラストラクチャを定義するエンドポイント設定を作成します。


# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")

主要パラメータ:

VariantName: このモデルバリアントの識別子 (単一モデルデプロイには 'primary' を使用)
ModelName: 上記で作成したモデルを参照します
InitialInstanceCount: デプロイするインスタンスの数 (1 で始まり、必要に応じて後でスケールします)
InstanceType: ステップ 3 で選択した ML インスタンスタイプ

エンドポイントをデプロイする


import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")

エンドポイントの作成をモニタリングする

以下のコードは、デプロイが完了するまでエンドポイントのステータスをポーリングします。


# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds

リソースの作成を確認する

リソースが正常に作成されたことを確認できます。


# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")

エンドポイントの準備が整っていることを確認する

選択したアプローチに関係なく、エンドポイント設定を検証できます。


# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break

エンドポイント作成失敗のトラブルシューティング

一般的な障害の理由:

キャパシティ不足: 要求されたインスタンスタイプは、お客様のリージョンでは利用できません
- 解決策: 別のインスタンスタイプを試すか、クォータの引き上げをリクエストします
IAM アクセス許可: 実行ロールに必要なアクセス許可がありません
- 解決策: ロールが Amazon S3 モデルアーティファクトと必要な SageMaker アクセス許可にアクセスできることを確認します
モデルアーティファクトが見つかりません: Amazon S3 URI が正しくないか、アクセスできません
- 解決策: Amazon S3 URI を検証し、バケットのアクセス許可を確認し、正しいリージョンにいることを確認します
リソース制限: エンドポイントまたはインスタンスのアカウント制限を超えました
- 解決策: Service Quotas または AWS サポートを通じて Service Quotas の引き上げをリクエストします

注記

失敗したエンドポイントを削除してやり直す必要がある場合:


sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)

ステップ 5: エンドポイントを呼び出す

エンドポイントが InService になったら、推論リクエストを送信して Amazon Nova モデルから予測を生成できます。SageMaker は、同期エンドポイント (ストリーミング/非ストリーミングモードによるリアルタイム) と非同期エンドポイント (バッチ処理用の Amazon S3 ベース) をサポートしています。

ランタイムクライアントをセットアップする

適切なタイムアウト設定で SageMaker Runtime クライアントを作成します。


import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)

ユニバーサル推論関数を作成する

以下の関数は、ストリーミングリクエストと非ストリーミングリクエストの両方を処理します。ステップ 4 で定義した INFERENCE_COMPONENT_NAME 変数を使用します。推論コンポーネント (オプション A) を使用してデプロイした場合、これは MODEL_NAME + "-IC" に設定されています。単一モデルエンドポイント (オプション B) を使用してデプロイした場合、これは定義されていないため、このステップを実行する前に None に設定します。


# Only needed if you followed Option B (single model endpoints) in Step 4:
# INFERENCE_COMPONENT_NAME = None

def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    Supports both inference component and single model endpoint deployments.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    # Build invoke parameters
    invoke_params = {
        'EndpointName': ENDPOINT_NAME,
        'ContentType': 'application/json',
        'Body': body
    }
    
    # Add InferenceComponentName if using inference components
    if INFERENCE_COMPONENT_NAME:
        invoke_params['InferenceComponentName'] = INFERENCE_COMPONENT_NAME
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(**invoke_params)
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            invoke_params['Accept'] = 'application/json'
            response = runtime_client.invoke_endpoint(**invoke_params)
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")

例 1: 非ストリーミング形式のチャット補完

会話型インタラクションには、チャット形式を使用します。


# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)

レスポンス例:


{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}

例 2: シンプルなテキスト補完

簡単なテキスト生成には、補完形式を使用します。


# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)

レスポンス例:


{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}

例 3: ストリーミング形式のチャット補完


# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)

ストリーミング出力の例:


Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once"},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon"},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a"},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time"},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]

例 4: マルチモーダル形式のチャット補完

イメージとテキストの入力には、マルチモーダル形式を使用します。


# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)

レスポンス例:


{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}

ステップ 6: リソースをクリーンアップする (オプション)

不要な料金が発生しないようにするには、このチュートリアルで作成した AWS リソースを削除します。SageMaker エンドポイントでは、推論リクエストをアクティブに行っていない場合でも、実行中に料金が発生します。

重要

リソースの削除は永続的であり、元に戻すことはできません。続行する前に、これらのリソースが不要になったことを確認してください。

クリーンアップクライアントを初期化する


import boto3
import time

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

推論コンポーネントを削除する (オプション A を使用している場合)

推論コンポーネントを使用してデプロイした場合は、エンドポイントを削除する前にまず推論コンポーネントを削除します。


# Delete inference component (Option A only)
try:
    print("Deleting inference component...")
    sagemaker.delete_inference_component(InferenceComponentName=INFERENCE_COMPONENT_NAME)
    print(f"✅ Inference component '{INFERENCE_COMPONENT_NAME}' deletion initiated")
except Exception as e:
    print(f"❌ Error deleting inference component: {e}")

# Wait for inference component to be deleted before proceeding
print("Waiting for inference component deletion...")
while True:
    try:
        sagemaker.describe_inference_component(InferenceComponentName=INFERENCE_COMPONENT_NAME)
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Inference component successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break

エンドポイントを削除する


try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")

注記

エンドポイントの削除は非同期で実行されます。削除ステータスをモニタリングできます。


import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break

エンドポイント設定を削除する

エンドポイントが削除されたら、エンドポイント設定を削除します。


try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")

モデルを削除する (オプション B のみ)

単一モデルエンドポイントを使用した場合は、SageMaker モデルオブジェクトを削除します。


try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")

ブラウザで JavaScript が無効になっているか、使用できません。

AWS ドキュメントを使用するには、JavaScript を有効にする必要があります。手順については、使用するブラウザのヘルプページを参照してください。

ドキュメントの表記規則

SageMaker 推論

コンテナ機能