

# SageMaker Inference
SageMaker Inference

Custom Amazon Nova models are now available on SageMaker inference. With Amazon Nova on SageMaker, you can start getting predictions, or inferences, from your trained custom Amazon Nova models. SageMaker provides a broad selection of ML infrastructure and model deployment options to help meet all your ML inference needs. With SageMaker inference, you can scale your model deployment, manage models more effectively in production, and reduce operational burden.

SageMaker provides you with various inference options, such as real-time endpoints for getting low latency inference, and asynchronous endpoints for batches of requests. By leveraging the appropriate inference option for your use case, you can ensure efficient model deployment and inference. For more information on SageMaker inference, see [Deploy models for inference](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html).

**Important**  
Only full-rank custom models and LoRA-merged models are supported on SageMaker inference. For unmerged LoRA models and base models, use Amazon Bedrock.

## Features


The following features are available for Amazon Nova models on SageMaker inference:

**Model Capabilities**
+ Text generation

**Deployment and Scaling**
+ Real-time endpoints with custom instance selection
+ Auto Scaling – Automatically adjust capacity based on traffic patterns to optimize costs and GPU utilization. For more information, see [Automatically Scale Amazon SageMaker Models](https://docs.aws.amazon.com//sagemaker/latest/dg/endpoint-auto-scaling.html).
+ Streaming API support for real-time token generation

**Monitoring and Optimization**
+ Amazon CloudWatch integration for monitoring and alerts
+ Availability Zone-aware latency optimization through VPC configuration

**Development Tools**
+ AWS CLI support – For more information, see [AWS CLI Command Reference for SageMaker](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/).
+  Notebook integration via SDK support

## Supported models and instances


When creating your SageMaker inference endpoints, you can set two environment variables to configure your deployment: `CONTEXT_LENGTH` and `MAX_CONCURRENCY`.
+ `CONTEXT_LENGTH` – Maximum total token length (input \$1 output) per request
+ `MAX_CONCURRENCY` – Maximum number of concurrent requests the endpoint will serve

The following table lists the supported Amazon Nova models, instance types, and supported configurations. The MAX\$1CONCURRENCY values represent the maximum supported concurrency for each CONTEXT\$1LENGTH setting:


****  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/nova/latest/nova2-userguide/nova-model-sagemaker-inference.html)

**Note**  
For instances where FP8 quantization is required, it will be enabled by default.  
The MAX\$1CONCURRENCY values shown are upper bounds for each CONTEXT\$1LENGTH setting. You can use lower context lengths with the same concurrency, but exceeding these values will cause SageMaker endpoint creation to fail.  
For example, on Amazon Nova Micro with a ml.g5.12xlarge:  
`CONTEXT_LENGTH=2000`, `MAX_CONCURRENCY=12` → Valid
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=12` → Rejected (concurrency limit is 6 at context length 8000)
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=4` → Valid
`CONTEXT_LENGTH=8000`, `MAX_CONCURRENCY=6` → Valid
`CONTEXT_LENGTH=10000` → Rejected (max context length is 8000 on this instance)

## Supported AWS Regions


The following table lists the AWS Regions where Amazon Nova models are available on SageMaker inference:


****  

| Region Name | Region Code | Availability | 
| --- | --- | --- | 
| US East (N. Virginia) | us-east-1 | Available | 
| US West (Oregon) | us-west-2 | Available | 

## Supported Container Images


The following table lists the container image URIs for Amazon Nova models on SageMaker inference by region.


****  

| Region | Container Image URIs | 
| --- | --- | 
| us-east-1 | 708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-inference-repo:SM-Inference-latest | 
| us-west-2 | 176779409107.dkr.ecr.us-west-2.amazonaws.com/nova-inference-repo:SM-Inference-latest | 

## Best Practices


For best practices on deploying and managing models on SageMaker, see [Best Practices for SageMaker](https://docs.aws.amazon.com//sagemaker/latest/dg/best-practices.html).

## Support


For issues and support with Amazon Nova models on SageMaker inference, contact AWS Support through the Console or your AWS account manager.

**Topics**
+ [

## Features
](#nova-sagemaker-inference-features)
+ [

## Supported models and instances
](#nova-sagemaker-inference-supported)
+ [

## Supported AWS Regions
](#nova-sagemaker-inference-regions)
+ [

## Supported Container Images
](#nova-sagemaker-inference-container-images)
+ [

## Best Practices
](#nova-sagemaker-inference-best-practices)
+ [

## Support
](#nova-sagemaker-inference-support)
+ [

# Getting Started
](nova-sagemaker-inference-getting-started.md)
+ [

# API Reference
](nova-sagemaker-inference-api-reference.md)
+ [

# Evaluate Models Hosted on SageMaker Inference
](nova-eval-on-sagemaker-inference.md)
+ [

# Deployment of Amazon Nova Forge Models in Amazon SageMaker Inference abuse detection
](nova-sagemaker-inference-abuse-detection.md)

# Getting Started
Getting started

This guide shows you how to deploy customized Amazon Nova models on SageMaker real-time endpoints, configure inference parameters, and invoke your models for testing.

## Prerequisites


The following are prerequisites to deploy Amazon Nova models on SageMaker inference:
+ Create an AWS account - If you don't have one already, see [Creating an AWS account](https://docs.aws.amazon.com//sagemaker/latest/dg/gs-set-up.html#sign-up-for-aws).
+ Required IAM permissions - Ensure your IAM user or role has the following managed policies attached:
  + `AmazonSageMakerFullAccess`
  + `AmazonS3FullAccess`
+ Required SDKs/CLI versions - The following SDK versions have been tested and validated with Amazon Nova models on SageMaker inference:
  + SageMaker Python SDK v3.0.0\$1 (`sagemaker>=3.0.0`) for resource-based API approach
  + Boto3 version 1.35.0\$1 (`boto3>=1.35.0`) for direct API calls. The examples in this guide use this approach.
+ Service quota increase - Request an Amazon SageMaker service quota increase for the ML instance type you plan to use for your SageMaker Inference endpoint (for example, `ml.p5.48xlarge for endpoint usage`). For a list of supported instance types, see [Supported models and instances](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported). To request an increase, see [Requesting a quota increase](https://docs.aws.amazon.com//servicequotas/latest/userguide/request-quota-increase.html). For information about SageMaker instance quotas, see [SageMaker endpoints and quotas](https://docs.aws.amazon.com//general/latest/gr/sagemaker.html).

## Step 1: Configure AWS credentials


Configure your AWS credentials using one of the following methods:

**Option 1: AWS CLI (Recommended)**

```
aws configure
```

Enter your AWS access key, secret key, and default region when prompted.

**Option 2: AWS credentials file**

Create or edit `~/.aws/credentials`:

```
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
```

**Option 3: Environment variables**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
```

**Note**  
For more information about AWS credentials, see [Configuration and credential file settings](https://docs.aws.amazon.com//cli/latest/userguide/cli-configure-files.html).

**Initialize AWS clients**

Create a Python script or notebook with the following code to initialize the AWS SDK and verify your credentials:

```
import boto3

# AWS Configuration - Update these for your environment
REGION = "us-east-1"  # Supported regions: us-east-1, us-west-2
AWS_ACCOUNT_ID = "YOUR_ACCOUNT_ID"  # Replace with your AWS account ID

# Initialize AWS clients using default credential chain
sagemaker = boto3.client('sagemaker', region_name=REGION)
sts = boto3.client('sts')

# Verify credentials
try:
    identity = sts.get_caller_identity()
    print(f"Successfully authenticated to AWS Account: {identity['Account']}")
    
    if identity['Account'] != AWS_ACCOUNT_ID:
        print(f"Warning: Connected to account {identity['Account']}, expected {AWS_ACCOUNT_ID}")

except Exception as e:
    print(f"Failed to authenticate: {e}")
    print("Please verify your credentials are configured correctly.")
```

If the authentication is successful, you should see output confirming your AWS account ID.

## Step 2: Create a SageMaker execution role


A SageMaker execution role is an IAM role that grants SageMaker permissions to access AWS resources on your behalf, such as Amazon S3 buckets for model artifacts and CloudWatch for logging.

**Creating the execution role**

**Note**  
Creating IAM roles requires `iam:CreateRole` and `iam:AttachRolePolicy` permissions. Ensure your IAM user or role has these permissions before proceeding.

The following code creates an IAM role with the necessary permissions for deploying Amazon Nova customized models:

```
import json

# Create SageMaker Execution Role
role_name = f"SageMakerInference-ExecutionRole-{AWS_ACCOUNT_ID}"

trust_policy = {
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {"Service": "sagemaker.amazonaws.com"},
            "Action": "sts:AssumeRole"
        }
    ]
}

iam = boto3.client('iam', region_name=REGION)

# Create the role
role_response = iam.create_role(
    RoleName=role_name,
    AssumeRolePolicyDocument=json.dumps(trust_policy),
    Description='SageMaker execution role with S3 and SageMaker access'
)

# Attach required policies
iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess'
)

iam.attach_role_policy(
    RoleName=role_name,
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess'
)

SAGEMAKER_EXECUTION_ROLE_ARN = role_response['Role']['Arn']
print(f"Created SageMaker execution role: {SAGEMAKER_EXECUTION_ROLE_ARN}")
```

**Using an existing execution role (Optional)**

If you already have a SageMaker execution role, you can use it instead:

```
# Replace with your existing role ARN
SAGEMAKER_EXECUTION_ROLE_ARN = "arn:aws:iam::YOUR_ACCOUNT_ID:role/YOUR_EXISTING_ROLE_NAME"
```

To find existing SageMaker roles in your account:

```
iam = boto3.client('iam', region_name=REGION)
response = iam.list_roles()
sagemaker_roles = [role for role in response['Roles'] if 'SageMaker' in role['RoleName']]
for role in sagemaker_roles:
    print(f"{role['RoleName']}: {role['Arn']}")
```

**Important**  
The execution role must have trust relationship with `sagemaker.amazonaws.com` and permissions to access Amazon S3 and SageMaker resources.

For more information about SageMaker execution roles, see [SageMaker Roles](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-roles.html).

## Step 3: Configure model parameters


Configure the deployment parameters for your Amazon Nova model. These settings control model behavior, resource allocation, and inference characteristics. For a list of supported instance types and supported CONTEXT\$1LENGTH and MAX\$1CONCURRENCY values for each, see [Supported models and instances](nova-model-sagemaker-inference.md#nova-sagemaker-inference-supported).

**Required parameters**
+ `IMAGE`: The Docker container image URI for Amazon Nova inference container. This will be provided by AWS.
+ `CONTEXT_LENGTH`: Model context length.
+ `MAX_CONCURRENCY`: Maximum number of sequences per iteration; sets the limit on how many individual user requests (prompts) can be processed concurrently within a single batch on the GPU. Range: integer greater than 0.

**Optional generation parameters**
+ `DEFAULT_TEMPERATURE`: Controls randomness in generation. Range: 0.0 to 2.0 (0.0 = deterministic, higher = more random).
+ `DEFAULT_TOP_P`: Nucleus sampling threshold for token selection. Range: 1e-10 to 1.0.
+ `DEFAULT_TOP_K`: Limits token selection to top K most likely tokens. Range: integer -1 or greater (-1 = no limit).
+ `DEFAULT_MAX_NEW_TOKENS`: Maximum number of tokens to generate in response (i.e. max output tokens). Range: integer 1 or greater.
+ `DEFAULT_LOGPROBS`: Number of log probabilities to return per token. Range: integer 1 to 20.

**Configure your deployment**

```
# AWS Configuration
REGION = "us-east-1"  # Must match region from Step 1

# ECR Account mapping by region
ECR_ACCOUNT_MAP = {
    "us-east-1": "708977205387",
    "us-west-2": "176779409107"
}

# Container Image
IMAGE = f"{ECR_ACCOUNT_MAP[REGION]}.dkr.ecr.{REGION}.amazonaws.com/nova-inference-repo:SM-Inference-latest"
print(f"IMAGE = {IMAGE}")

# Model Parameters
CONTEXT_LENGTH = "16000"       # Maximum total context length
MAX_CONCURRENCY = "2"          # Maximum concurrent sequences

# Optional: Default generation parameters (uncomment to use)
DEFAULT_TEMPERATURE = "0.0"   # Deterministic output
DEFAULT_TOP_P = "1.0"         # Consider all tokens
# DEFAULT_TOP_K = "50"        # Uncomment to limit to top 50 tokens
# DEFAULT_MAX_NEW_TOKENS = "2048"  # Uncomment to set max output tokens
# DEFAULT_LOGPROBS = "1"      # Uncomment to enable log probabilities

# Build environment variables for the container
environment = {
    'CONTEXT_LENGTH': CONTEXT_LENGTH,
    'MAX_CONCURRENCY': MAX_CONCURRENCY,
}

# Add optional parameters if defined
if 'DEFAULT_TEMPERATURE' in globals():
    environment['DEFAULT_TEMPERATURE'] = DEFAULT_TEMPERATURE
if 'DEFAULT_TOP_P' in globals():
    environment['DEFAULT_TOP_P'] = DEFAULT_TOP_P
if 'DEFAULT_TOP_K' in globals():
    environment['DEFAULT_TOP_K'] = DEFAULT_TOP_K
if 'DEFAULT_MAX_NEW_TOKENS' in globals():
    environment['DEFAULT_MAX_NEW_TOKENS'] = DEFAULT_MAX_NEW_TOKENS
if 'DEFAULT_LOGPROBS' in globals():
    environment['DEFAULT_LOGPROBS'] = DEFAULT_LOGPROBS

print("Environment configuration:")
for key, value in environment.items():
    print(f"  {key}: {value}")
```

**Configure deployment-specific parameters**

Now configure the specific parameters for your Amazon Nova model deployment, including model artifacts location and instance type selection.

**Set deployment identifier**

```
# Deployment identifier - use a descriptive name for your use case
JOB_NAME = "my-nova-deployment"
```

**Specify model artifacts location**

Provide the Amazon S3 URI where your trained Amazon Nova model artifacts are stored. This should be the output location from your model training or fine-tuning job.

```
# S3 location of your trained Nova model artifacts
# Replace with your model's S3 URI - must end with /
MODEL_S3_LOCATION = "s3://your-bucket-name/path/to/model/artifacts/"
```

**Select model variant and instance type**

```
# Configure model variant and instance type
TESTCASE = {
    "model": "lite2",              # Options: micro, lite, lite2
    "instance": "ml.p5.48xlarge"   # Refer to "Supported models and instances" section
}

# Generate resource names
INSTANCE_TYPE = TESTCASE["instance"]
MODEL_NAME = JOB_NAME + "-" + TESTCASE["model"] + "-" + INSTANCE_TYPE.replace(".", "-")
ENDPOINT_CONFIG_NAME = MODEL_NAME + "-Config"
ENDPOINT_NAME = MODEL_NAME + "-Endpoint"

print(f"Model Name: {MODEL_NAME}")
print(f"Endpoint Config: {ENDPOINT_CONFIG_NAME}")
print(f"Endpoint Name: {ENDPOINT_NAME}")
```

**Naming conventions**

The code automatically generates consistent names for AWS resources:
+ Model Name: `{JOB_NAME}-{model}-{instance-type}`
+ Endpoint Config: `{MODEL_NAME}-Config`
+ Endpoint Name: `{MODEL_NAME}-Endpoint`

## Step 4: Create SageMaker model and endpoint configuration


In this step, you'll create two essential resources: a SageMaker model object that references your Amazon Nova model artifacts, and an endpoint configuration that defines how the model will be deployed.

**SageMaker Model**: A model object that packages the inference container image, model artifacts location, and environment configuration. This is a reusable resource that can be deployed to multiple endpoints.

**Endpoint Configuration**: Defines the infrastructure settings for deployment, including instance type, instance count, and model variants. This allows you to manage deployment settings separately from the model itself.

**Create the SageMaker model**

The following code creates a SageMaker model that references your Amazon Nova model artifacts:

```
try:
    model_response = sagemaker.create_model(
        ModelName=MODEL_NAME,
        PrimaryContainer={
            'Image': IMAGE,
            'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': MODEL_S3_LOCATION,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None'
                }
            },
            'Environment': environment
        },
        ExecutionRoleArn=SAGEMAKER_EXECUTION_ROLE_ARN,
        EnableNetworkIsolation=True
    )
    print("Model created successfully!")
    print(f"Model ARN: {model_response['ModelArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating model: {e}")
```

Key parameters:
+ `ModelName`: Unique identifier for your model
+ `Image`: Docker container image URI for Amazon Nova inference
+ `ModelDataSource`: Amazon S3 location of your model artifacts
+ `Environment`: Environment variables configured in Step 3
+ `ExecutionRoleArn`: IAM role from Step 2
+ `EnableNetworkIsolation`: Set to True for enhanced security (prevents container from making outbound network calls)

**Create the endpoint configuration**

Next, create an endpoint configuration that defines your deployment infrastructure:

```
# Create Endpoint Configuration
try:
    production_variant = {
        'VariantName': 'primary',
        'ModelName': MODEL_NAME,
        'InitialInstanceCount': 1,
        'InstanceType': INSTANCE_TYPE,
    }
    
    config_response = sagemaker.create_endpoint_config(
        EndpointConfigName=ENDPOINT_CONFIG_NAME,
        ProductionVariants=[production_variant]
    )
    print("Endpoint configuration created successfully!")
    print(f"Config ARN: {config_response['EndpointConfigArn']}")
    
except sagemaker.exceptions.ClientError as e:
    print(f"Error creating endpoint configuration: {e}")
```

Key parameters:
+ `VariantName`: Identifier for this model variant (use 'primary' for single-model deployments)
+ `ModelName`: References the model created above
+ `InitialInstanceCount`: Number of instances to deploy (start with 1, scale later if needed)
+ `InstanceType`: ML instance type selected in Step 3

**Verify resource creation**

You can verify that your resources were created successfully:

```
# Describe the model
model_info = sagemaker.describe_model(ModelName=MODEL_NAME)
print(f"Model Status: {model_info['ModelName']} created")

# Describe the endpoint configuration
config_info = sagemaker.describe_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
print(f"Endpoint Config Status: {config_info['EndpointConfigName']} created")
```

## Step 5: Deploy the endpoint


The next step is to deploy your Amazon Nova model by creating a SageMaker real-time endpoint. This endpoint will host your model and provide a secure HTTPS endpoint for making inference requests.

Endpoint creation typically takes 15-30 minutes as AWS provisions the infrastructure, downloads your model artifacts, and initializes the inference container.

**Create the endpoint**

```
import time

try:
    endpoint_response = sagemaker.create_endpoint(
        EndpointName=ENDPOINT_NAME,
        EndpointConfigName=ENDPOINT_CONFIG_NAME
    )
    print("Endpoint creation initiated successfully!")
    print(f"Endpoint ARN: {endpoint_response['EndpointArn']}")
except Exception as e:
    print(f"Error creating endpoint: {e}")
```

**Monitor endpoint creation**

The following code polls the endpoint status until deployment is complete:

```
# Monitor endpoint creation progress
print("Waiting for endpoint creation to complete...")
print("This typically takes 15-30 minutes...\n")

while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        
        if status == 'Creating':
            print(f"⏳ Status: {status} - Provisioning infrastructure and loading model...")
        elif status == 'InService':
            print(f"✅ Status: {status}")
            print("\nEndpoint creation completed successfully!")
            print(f"Endpoint Name: {ENDPOINT_NAME}")
            print(f"Endpoint ARN: {response['EndpointArn']}")
            break
        elif status == 'Failed':
            print(f"❌ Status: {status}")
            print(f"Failure Reason: {response.get('FailureReason', 'Unknown')}")
            print("\nFull response:")
            print(response)
            break
        else:
            print(f"Status: {status}")
        
    except Exception as e:
        print(f"Error checking endpoint status: {e}")
        break
    
    time.sleep(30)  # Check every 30 seconds
```

**Verify endpoint is ready**

Once the endpoint is InService, you can verify its configuration:

```
# Get detailed endpoint information
endpoint_info = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)

print("\n=== Endpoint Details ===")
print(f"Endpoint Name: {endpoint_info['EndpointName']}")
print(f"Endpoint ARN: {endpoint_info['EndpointArn']}")
print(f"Status: {endpoint_info['EndpointStatus']}")
print(f"Creation Time: {endpoint_info['CreationTime']}")
print(f"Last Modified: {endpoint_info['LastModifiedTime']}")

# Get endpoint config for instance type details
endpoint_config_name = endpoint_info['EndpointConfigName']
endpoint_config = sagemaker.describe_endpoint_config(EndpointConfigName=endpoint_config_name)

# Display production variant details
for variant in endpoint_info['ProductionVariants']:
    print(f"\nProduction Variant: {variant['VariantName']}")
    print(f"  Current Instance Count: {variant['CurrentInstanceCount']}")
    print(f"  Desired Instance Count: {variant['DesiredInstanceCount']}")
    # Get instance type from endpoint config
    for config_variant in endpoint_config['ProductionVariants']:
        if config_variant['VariantName'] == variant['VariantName']:
            print(f"  Instance Type: {config_variant['InstanceType']}")
            break
```

**Troubleshooting endpoint creation failures**

Common failure reasons:
+ **Insufficient capacity**: The requested instance type is not available in your region
  + Solution: Try a different instance type or request a quota increase
+ **IAM permissions**: The execution role lacks necessary permissions
  + Solution: Verify the role has access to Amazon S3 model artifacts and necessary SageMaker permissions
+ **Model artifacts not found**: The Amazon S3 URI is incorrect or inaccessible
  + Solution: Verify the Amazon S3 URI and check bucket permissions, make sure you're in the correct region
+ **Resource limits**: Account limits exceeded for endpoints or instances
  + Solution: Request a service quota increase through Service Quotas or AWS Support

**Note**  
If you need to delete a failed endpoint and start over:  

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## Step 6: Invoke the endpoint


Once your endpoint is InService, you can send inference requests to generate predictions from your Amazon Nova model. SageMaker supports synchronous endpoints (real-time with streaming/non-streaming modes) and asynchronous endpoints (Amazon S3-based for batch processing).

**Set up the runtime client**

Create a SageMaker Runtime client with appropriate timeout settings:

```
import json
import boto3
import botocore
from botocore.exceptions import ClientError

# Configure client with appropriate timeouts
config = botocore.config.Config(
    read_timeout=120,      # Maximum time to wait for response
    connect_timeout=10,    # Maximum time to establish connection
    retries={'max_attempts': 3}  # Number of retry attempts
)

# Create SageMaker Runtime client
runtime_client = boto3.client('sagemaker-runtime', config=config, region_name=REGION)
```

**Create a universal inference function**

The following function handles both streaming and non-streaming requests:

```
def invoke_nova_endpoint(request_body):
    """
    Invoke Nova endpoint with automatic streaming detection.
    
    Args:
        request_body (dict): Request payload containing prompt and parameters
    
    Returns:
        dict: Response from the model (for non-streaming requests)
        None: For streaming requests (prints output directly)
    """
    body = json.dumps(request_body)
    is_streaming = request_body.get("stream", False)
    
    try:
        print(f"Invoking endpoint ({'streaming' if is_streaming else 'non-streaming'})...")
        
        if is_streaming:
            response = runtime_client.invoke_endpoint_with_response_stream(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Body=body
            )
            
            event_stream = response['Body']
            for event in event_stream:
                if 'PayloadPart' in event:
                    chunk = event['PayloadPart']
                    if 'Bytes' in chunk:
                        data = chunk['Bytes'].decode()
                        print("Chunk:", data)
        else:
            # Non-streaming inference
            response = runtime_client.invoke_endpoint(
                EndpointName=ENDPOINT_NAME,
                ContentType='application/json',
                Accept='application/json',
                Body=body
            )
            
            response_body = response['Body'].read().decode('utf-8')
            result = json.loads(response_body)
            print("✅ Response received successfully")
            return result
    
    except ClientError as e:
        error_code = e.response['Error']['Code']
        error_message = e.response['Error']['Message']
        print(f"❌ AWS Error: {error_code} - {error_message}")
    except Exception as e:
        print(f"❌ Unexpected error: {str(e)}")
```

**Example 1: Non-streaming chat completion**

Use the chat format for conversational interactions:

```
# Non-streaming chat request
chat_request = {
    "messages": [
        {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 100,
    "max_completion_tokens": 100,  # Alternative to max_tokens
    "stream": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "logprobs": True,
    "top_logprobs": 3,
    "reasoning_effort": "low",  # Options: "low", "high"
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(chat_request)
```

**Sample response:**

```
{
    "id": "chatcmpl-123456",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! I'm doing well, thank you for asking. I'm here and ready to help you with any questions or tasks you might have. How can I assist you today?"
            },
            "logprobs": {
                "content": [
                    {
                        "token": "Hello",
                        "logprob": -0.123,
                        "top_logprobs": [
                            {"token": "Hello", "logprob": -0.123},
                            {"token": "Hi", "logprob": -2.456},
                            {"token": "Hey", "logprob": -3.789}
                        ]
                    }
                    # Additional tokens...
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 12,
        "completion_tokens": 28,
        "total_tokens": 40
    }
}
```

**Example 2: Simple text completion**

Use the completion format for simple text generation:

```
# Simple completion request
completion_request = {
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "stream": False,
    "temperature": 0.0,
    "top_p": 1.0,
    "top_k": -1,  # -1 means no limit
    "logprobs": 3,  # Number of log probabilities to return
    "allowed_token_ids": None,  # List of allowed token IDs
    "truncate_prompt_tokens": None,  # Truncate prompt to this many tokens
    "stream_options": None
}

response = invoke_nova_endpoint(completion_request)
```

**Sample response:**

```
{
    "id": "cmpl-789012",
    "object": "text_completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "text": " Paris.",
            "index": 0,
            "logprobs": {
                "tokens": [" Paris", "."],
                "token_logprobs": [-0.001, -0.002],
                "top_logprobs": [
                    {" Paris": -0.001, " London": -5.234, " Rome": -6.789},
                    {".": -0.002, ",": -4.567, "!": -7.890}
                ]
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 6,
        "completion_tokens": 2,
        "total_tokens": 8
    }
}
```

**Example 3: Streaming chat completion**

```
# Streaming chat request
streaming_request = {
    "messages": [
        {"role": "user", "content": "Tell me a short story about a robot"}
    ],
    "max_tokens": 200,
    "stream": True,
    "temperature": 0.7,
    "top_p": 0.95,
    "top_k": 40,
    "logprobs": True,
    "top_logprobs": 2,
    "reasoning_effort": "high",  # For more detailed reasoning
    "stream_options": {"include_usage": True}
}

invoke_nova_endpoint(streaming_request)
```

**Sample streaming output:**

```
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" Once","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101],"top_logprobs":[{"token":"\u2581Once","logprob":-0.6078429222106934,"bytes":[226,150,129,79,110,99,101]},{"token":"\u2581In","logprob":-0.7864127159118652,"bytes":[226,150,129,73,110]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" upon","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110],"top_logprobs":[{"token":"\u2581upon","logprob":-0.0012345,"bytes":[226,150,129,117,112,111,110]},{"token":"\u2581a","logprob":-6.789,"bytes":[226,150,129,97]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" a","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97],"top_logprobs":[{"token":"\u2581a","logprob":-0.0001234,"bytes":[226,150,129,97]},{"token":"\u2581time","logprob":-9.123,"bytes":[226,150,129,116,105,109,101]}]}]},"finish_reason":null,"token_ids":null}]}
Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{"content":" time","reasoning_content":null},"logprobs":{"content":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101],"top_logprobs":[{"token":"\u2581time","logprob":-0.0023456,"bytes":[226,150,129,116,105,109,101]},{"token":",","logprob":-6.012,"bytes":[44]}]}]},"finish_reason":null,"token_ids":null}]}

# Additional chunks...

Chunk: data: {"id":"chatcmpl-029ca032-fa01-4868-80b7-c4cb1af90fb9","object":"chat.completion.chunk","created":1772060532,"model":"default","choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":15,"completion_tokens":87,"total_tokens":102}}
Chunk: data: [DONE]
```

**Example 4: Multimodal chat completion**

Use multimodal format for image and text inputs:

```
# Multimodal chat request (if supported by your model)
multimodal_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
            ]
        }
    ],
    "max_tokens": 150,
    "temperature": 0.3,
    "top_p": 0.8,
    "stream": False
}

response = invoke_nova_endpoint(multimodal_request)
```

**Sample response:**

```
{
    "id": "chatcmpl-345678",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "default",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The image shows..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 1250,
        "completion_tokens": 45,
        "total_tokens": 1295
    }
}
```

## Step 7: Clean up resources (Optional)


To avoid incurring unnecessary charges, delete the AWS resources you created during this tutorial. SageMaker endpoints incur charges while they're running, even if you're not actively making inference requests.

**Important**  
Deleting resources is permanent and cannot be undone. Ensure you no longer need these resources before proceeding.

**Delete the endpoint**

```
import boto3

# Initialize SageMaker client
sagemaker = boto3.client('sagemaker', region_name=REGION)

try:
    print("Deleting endpoint...")
    sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
    print(f"✅ Endpoint '{ENDPOINT_NAME}' deletion initiated")
    print("Charges will stop once deletion completes (typically 2-5 minutes)")
except Exception as e:
    print(f"❌ Error deleting endpoint: {e}")
```

**Note**  
The endpoint deletion is asynchronous. You can monitor the deletion status:  

```
import time

print("Monitoring endpoint deletion...")
while True:
    try:
        response = sagemaker.describe_endpoint(EndpointName=ENDPOINT_NAME)
        status = response['EndpointStatus']
        print(f"Status: {status}")
        time.sleep(10)
    except sagemaker.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'ValidationException':
            print("✅ Endpoint successfully deleted")
            break
        else:
            print(f"Error: {e}")
            break
```

**Delete the endpoint configuration**

After the endpoint is deleted, remove the endpoint configuration:

```
try:
    print("Deleting endpoint configuration...")
    sagemaker.delete_endpoint_config(EndpointConfigName=ENDPOINT_CONFIG_NAME)
    print(f"✅ Endpoint configuration '{ENDPOINT_CONFIG_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting endpoint configuration: {e}")
```

**Delete the model**

Remove the SageMaker model object:

```
try:
    print("Deleting model...")
    sagemaker.delete_model(ModelName=MODEL_NAME)
    print(f"✅ Model '{MODEL_NAME}' deleted")
except Exception as e:
    print(f"❌ Error deleting model: {e}")
```

# API Reference
API reference

Amazon Nova models on SageMaker use the standard SageMaker Runtime API for inference. For complete API documentation, see [Test a deployed model](https://docs.aws.amazon.com//sagemaker/latest/dg/realtime-endpoints-test-endpoints.html).

## Endpoint invocation


Amazon Nova models on SageMaker support two invocation methods:
+ **Synchronous invocation**: Use the [InvokeEndpoint](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API for real-time, non-streaming inference requests.
+ **Streaming invocation**: Use the [InvokeEndpointWithResponseStream](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html) API for real-time streaming inference requests.

## Request format


Amazon Nova models support two request formats:

**Chat completion format**

Use this format for conversational interactions:

```
{
  "messages": [
    {"role": "user", "content": "string"}
  ],
  "max_tokens": integer,
  "max_completion_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": boolean,
  "top_logprobs": integer,
  "reasoning_effort": "low" | "high",
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**Text completion format**

Use this format for simple text generation:

```
{
  "prompt": "string",
  "max_tokens": integer,
  "stream": boolean,
  "temperature": float,
  "top_p": float,
  "top_k": integer,
  "logprobs": integer,
  "allowed_token_ids": [integer],
  "truncate_prompt_tokens": integer,
  "stream_options": {
    "include_usage": boolean
  }
}
```

**Multimodal chat completion format**

Use this format for image and text inputs:

```
{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}
      ]
    }
  ],
  "max_tokens": integer,
  "temperature": float,
  "top_p": float,
  "stream": boolean
}
```

**Request parameters**
+ `messages` (array): For chat completion format. Array of message objects with `role` and `content` fields. Content can be a string for text-only or an array for multimodal inputs.
+ `prompt` (string): For text completion format. The input text to generate from.
+ `max_tokens` (integer): Maximum number of tokens to generate in the response. Range: 1 or greater.
+ `max_completion_tokens` (integer): Alternative to max\$1tokens for chat completions. Maximum number of completion tokens to generate.
+ `temperature` (float): Controls randomness in generation. Range: 0.0 to 2.0 (0.0 = deterministic, 2.0 = maximum randomness).
+ `top_p` (float): Nucleus sampling threshold. Range: 1e-10 to 1.0.
+ `top_k` (integer): Limits token selection to top K most likely tokens. Range: -1 or greater (-1 = no limit).
+ `stream` (boolean): Whether to stream the response. Set to `true` for streaming, `false` for non-streaming.
+ `logprobs` (boolean/integer): For chat completions, use boolean. For text completions, use integer for number of log probabilities to return. Range: 1 to 20.
+ `top_logprobs` (integer): Number of most likely tokens to return log probabilities for (chat completions only).
+ `reasoning_effort` (string): Level of reasoning effort. Options: "low", "high" (chat completions for Nova 2 Lite custom models only).
+ `allowed_token_ids` (array): List of token IDs that are allowed to be generated. Restricts output to specified tokens.
+ `truncate_prompt_tokens` (integer): Truncate the prompt to this many tokens if it exceeds the limit.
+ `stream_options` (object): Options for streaming responses. Contains `include_usage` boolean to include token usage in streaming responses.

## Response format


The response format depends on the invocation method and request type:

**Chat completion response (non-streaming)**

For synchronous chat completion requests:

```
{
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I'm doing well, thank you for asking. How can I help you today?",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": [9906, 0, 358, 2157, 1049, 11, 1309, 345, 369, 6464, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  },
  "prompt_token_ids": [9906, 0, 358]
}
```

**Text completion response (non-streaming)**

For synchronous text completion requests:

```
{
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris, the capital and most populous city of France.",
      "logprobs": {
        "tokens": ["Paris", ",", " the", " capital"],
        "token_logprobs": [-0.31725305, -0.07918124, -0.12345678, -0.23456789],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403,
            "Rome": -2.1234567
          },
          {
            ",": -0.07918124,
            " is": -1.2345678
          }
        ]
      },
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_token_ids": [464, 6864, 315, 4881, 374],
      "token_ids": [3915, 11, 279, 6864, 323, 1455, 95551, 3363, 315, 4881, 13]
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}
```

**Chat completion streaming response**

For streaming chat completion requests, responses are sent as Server-Sent Events (SSE):

```
data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "role": "assistant",
        "content": "Hello",
        "refusal": null,
        "reasoning": null,
        "reasoning_content": null
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              }
            ]
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null,
  "prompt_token_ids": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {
        "content": "! I'm"
      },
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "chatcmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "chat.completion.chunk",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "delta": {},
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 12,
    "total_tokens": 21,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  }
}

data: [DONE]
```

**Text completion streaming response**

For streaming text completion requests:

```
data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "Paris",
      "logprobs": {
        "tokens": ["Paris"],
        "token_logprobs": [-0.31725305],
        "top_logprobs": [
          {
            "Paris": -0.31725305,
            "London": -1.3190403
          }
        ]
      },
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": ", the capital",
      "logprobs": null,
      "finish_reason": null,
      "stop_reason": null
    }
  ],
  "usage": null
}

data: {
  "id": "cmpl-123e4567-e89b-12d3-a456-426614174000",
  "object": "text_completion",
  "created": 1677652288,
  "model": "nova-micro-custom",
  "choices": [
    {
      "index": 0,
      "text": "",
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 11,
    "total_tokens": 16
  }
}

data: [DONE]
```

**Response fields explanation**
+ `id`: Unique identifier for the completion
+ `object`: Type of object returned ("chat.completion", "text\$1completion", "chat.completion.chunk")
+ `created`: Unix timestamp of when the completion was created
+ `model`: Model used for the completion
+ `choices`: Array of completion choices
+ `usage`: Token usage information including prompt, completion, and total tokens
+ `logprobs`: Log probability information for tokens (when requested)
+ `finish_reason`: Reason why the model stopped generating ("stop", "length", "content\$1filter")
+ `delta`: Incremental content in streaming responses
+ `reasoning`: Reasoning content when reasoning\$1effort is used
+ `token_ids`: Array of token IDs for the generated text

For complete API documentation, see [InvokeEndpoint API reference](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) and [InvokeEndpointWithResponseStream API reference](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html).

# Evaluate Models Hosted on SageMaker Inference
Evaluate models

This guide explains how to evaluate your customized Amazon Nova models deployed on SageMaker inference endpoints using [Inspect AI](https://github.com/UKGovernmentBEIS/inspect_ai), an open-source evaluation framework.

**Note**  
For a hands-on walkthrough, see the [SageMaker Inspect AI quickstart notebook](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb).

## Overview


You can evaluate your customized Amazon Nova models deployed on SageMaker endpoints using standardized benchmarks from the AI research community. This approach enables you to:
+ Evaluate customized Amazon Nova models (fine-tuned, distilled, or otherwise adapted) at scale
+ Run evaluations with parallel inference across multiple endpoint instances
+ Compare model performance using benchmarks like MMLU, TruthfulQA, and HumanEval
+ Integrate with your existing SageMaker infrastructure

## Supported models


The SageMaker inference provider works with:
+ Amazon Nova models (Nova Micro, Nova Lite, Nova Lite 2)
+ Models deployed via vLLM or OpenAI-compatible inference servers
+ Any endpoint that supports the OpenAI Chat Completions API format

## Prerequisites


Before you begin, ensure you have:
+ An AWS account with permissions to create and invoke SageMaker endpoints
+ AWS credentials configured via AWS CLI, environment variables, or IAM role
+ Python 3.9 or higher

**Required IAM permissions**

Your IAM user or role needs the following permissions:

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:InvokeEndpoint",
        "sagemaker:DescribeEndpoint"
      ],
      "Resource": "arn:aws:sagemaker:*:*:endpoint/*"
    }
  ]
}
```

## Step 1: Deploy a SageMaker endpoint


Before running evaluations, you need a SageMaker inference endpoint running your model.

For instructions on creating a SageMaker inference endpoint with Amazon Nova models, see [Getting Started](nova-sagemaker-inference-getting-started.md).

Once your endpoint is in `InService` status, note the endpoint name for use in the evaluation commands.

## Step 2: Install evaluation dependencies


Create a Python virtual environment and install the required packages.

```
# Create virtual environment
python3.12 -m venv venv
source venv/bin/activate

# Install uv for faster package installation
pip install uv

# Install Inspect AI and evaluation benchmarks
uv pip install inspect-ai inspect-evals

# Install AWS dependencies
uv pip install aioboto3 boto3 botocore openai
```

## Step 3: Configure AWS credentials


Choose one of the following authentication methods:

**Option 1: AWS CLI (Recommended)**

```
aws configure
```

Enter your AWS Access Key ID, Secret Access Key, and default region when prompted.

**Option 2: Environment variables**

```
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2
```

**Option 3: IAM role**

If running on Amazon EC2 or SageMaker notebooks, the instance's IAM role is used automatically.

**Verify credentials**

```
import boto3

sts = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Account: {identity['Account']}")
print(f"User/Role: {identity['Arn']}")
```

## Step 4: Install the SageMaker provider


The SageMaker provider enables Inspect AI to communicate with your SageMaker endpoints. The provider installation process is streamlined in the [quickstart notebook](https://github.com/aws-samples/amazon-nova-samples/tree/main/customization/sagemaker-inference/sagemaker_inspect_quickstart.ipynb).

## Step 5: Download evaluation benchmarks


Clone the Inspect Evals repository to access standard benchmarks:

```
git clone https://github.com/UKGovernmentBEIS/inspect_evals.git
```

This repository includes benchmarks such as:
+ MMLU and MMLU-Pro (knowledge and reasoning)
+ TruthfulQA (truthfulness)
+ HumanEval (code generation)
+ GSM8K (mathematical reasoning)

## Step 6: Run evaluations


Run an evaluation using your SageMaker endpoint:

```
cd inspect_evals/src/inspect_evals/

inspect eval mmlu_pro/mmlu_pro.py \
  --model sagemaker/my-nova-endpoint \
  -M region_name=us-west-2 \
  --max-connections 256 \
  --max-retries 100 \
  --display plain
```

**Key parameters**


| Parameter | Default | Description | 
| --- | --- | --- | 
| --max-connections | 10 | Number of parallel requests to the endpoint. Scale with instance count (e.g., 10 instances × 25 = 250). | 
| --max-retries | 3 | Retry attempts for failed requests. Use 50-100 for large evaluations. | 
| -M region\$1name | us-east-1 | AWS region where your endpoint is deployed. | 
| -M read\$1timeout | 600 | Request timeout in seconds. | 
| -M connect\$1timeout | 60 | Connection timeout in seconds. | 

**Tuning recommendations**

For a multi-instance endpoint:

```
# 10-instance endpoint example
--max-connections 250   # ~25 connections per instance
--max-retries 100       # Handle transient errors
```

Setting `--max-connections` too high may overwhelm the endpoint and cause throttling. Setting it too low underutilizes capacity.

## Step 7: View results


Launch the Inspect AI viewer to analyze evaluation results:

```
inspect view
```

The viewer displays:
+ Overall scores and metrics
+ Per-sample results with model responses
+ Error analysis and failure patterns

## Managing endpoints


**Update an endpoint**

To update an existing endpoint with a new model or configuration:

```
import boto3

sagemaker = boto3.client('sagemaker', region_name=REGION)

# Create new model and endpoint configuration
# Then update the endpoint
sagemaker.update_endpoint(
    EndpointName=EXISTING_ENDPOINT_NAME,
    EndpointConfigName=NEW_ENDPOINT_CONFIG_NAME
)
```

**Delete an endpoint**

```
sagemaker.delete_endpoint(EndpointName=ENDPOINT_NAME)
```

## Onboarding custom benchmarks


You can add new benchmarks to Inspect AI using the following workflow:

1. Study the benchmark's dataset format and evaluation metrics

1. Review similar implementations in `inspect_evals/`

1. Create a task file that converts dataset records to Inspect AI samples

1. Implement appropriate solvers and scorers

1. Validate with a small test run

Example task structure:

```
from inspect_ai import Task, task
from inspect_ai.dataset import hf_dataset
from inspect_ai.scorer import choice
from inspect_ai.solver import multiple_choice

@task
def my_benchmark():
    return Task(
        dataset=hf_dataset("dataset_name", split="test"),
        solver=multiple_choice(),
        scorer=choice()
    )
```

## Troubleshooting


**Common issues**

**Endpoint throttling or timeouts**
+ Reduce `--max-connections`
+ Increase `--max-retries`
+ Check endpoint CloudWatch metrics for capacity issues

**Authentication errors**
+ Verify AWS credentials are configured correctly
+ Check IAM permissions include `sagemaker:InvokeEndpoint`

**Model errors**
+ Verify the endpoint is in `InService` status
+ Check that the model supports the OpenAI Chat Completions API format

## Related resources

+ [Inspect AI Documentation](https://inspect.ai-safety-institute.org.uk/)
+ [Inspect Evals Repository](https://github.com/UKGovernmentBEIS/inspect_evals)
+ [SageMaker Developer Guide](https://docs.aws.amazon.com//sagemaker/latest/dg/whatis.html)
+ [Deploy Models for Inference](https://docs.aws.amazon.com//sagemaker/latest/dg/deploy-model.html)
+ [Configuring the AWS CLI](https://docs.aws.amazon.com//cli/latest/userguide/cli-chap-configure.html)

# Deployment of Amazon Nova Forge Models in Amazon SageMaker Inference abuse detection
Abuse detection for Amazon Nova Forge

AWS is committed to the responsible use of AI. To help prevent potential misuse, when you deploy Amazon Nova Forge Models in Amazon SageMaker Inference, SageMaker Inference implements automated abuse detection mechanisms to identify potential violations of AWS's [Acceptable Use Policy](https://aws.amazon.com/aup/) (AUP) and Service Terms, including the [Responsible AI Policy](https://aws.amazon.com/ai/responsible-ai/policy/).

Our abuse detection mechanisms are fully automated, so there is no human review of, or access to, user inputs or model outputs.

Automated abuse detection includes:
+ **Categorize content** – We use classifiers to detect harmful content (such as content that incites violence) in user inputs and model outputs. A classifier is an algorithm that processes model inputs and outputs, and assigns type of harm and level of confidence. We may run these classifiers on Amazon Nova Forge Model usage. The classification process is automated and does not involve human review of user inputs or model outputs.
+ **Identify patterns** – We use classifier metrics to identify potential violations and recurring behavior. We may compile and share anonymized classifier metrics. Amazon SageMaker Inference does not store user input or model output.
+ **Detecting and blocking child sexual abuse material (CSAM)** – You are responsible for the content you (and your end users) upload to Amazon SageMaker Inference and must ensure this content is free from illegal images. To help stop the dissemination of CSAM, when deploying an Amazon Nova Forge Model in Amazon SageMaker Inference, SageMaker Inference may use automated abuse detection mechanisms (such as hash matching technology or classifiers) to detect apparent CSAM. If Amazon SageMaker Inference detects apparent CSAM in your image inputs, Amazon SageMaker Inference will block the request and you will receive an automated error message. Amazon SageMaker Inference may also file a report with the National Center for Missing and Exploited Children (NCMEC) or a relevant authority. We take CSAM seriously and will continue to update our detection, blocking, and reporting mechanisms. You might be required by applicable laws to take additional actions, and you are responsible for those actions.

Once our automated abuse detection mechanisms identify potential violations, we may request information about your use of Amazon SageMaker Inference and compliance with our terms of service. In the event that you are non-responsive, unwilling, or unable to comply with these terms or policies, AWS may suspend your access to Amazon SageMaker Inference. You may also be billed for the failed inference job if our automated tests detect model responses being inconsistent with our terms and policies.

Contact AWS Support if you have additional questions. For more information, see the [Amazon SageMaker FAQs](https://aws.amazon.com/sagemaker/ai/faqs/).