

# Optimized generative AI inference recommendations
<a name="generative-ai-inference-recommendations"></a>

Amazon SageMaker AI now supports inference recommendations, a capability that eliminates manual optimization and benchmarking to deliver optimal inference performance. Instead of manually testing combinations of GPU instance types, serving containers, parallelism strategies, and optimization techniques, you provide your model and workload requirements, and SageMaker AI returns validated, deployment-ready configurations with real performance metrics.

Inference recommendations analyzes your model's architecture, narrows the configuration space, and applies goal-aligned optimizations such as speculative decoding for throughput and kernel tuning for latency. By evaluating multiple instance types, you can select the most price-performant option for your workload. It benchmarks each configuration on real GPU infrastructure, so you can deploy with confidence and right-size your inference spend.

## How it works
<a name="generative-ai-inference-recommendations-how-it-works"></a>

Getting started with inference recommendations is straightforward, whether through SageMaker AI Studio or the SageMaker AI APIs. The following steps describe the workflow.

1. **Prepare your model.** Point to model artifacts in Amazon S3 or the SageMaker AI Model Registry. Inference recommendations supports HuggingFace checkpoint format with SafeTensor weights, including base models and custom or fine-tuned models.

1. **Define your workload.** Describe your expected traffic patterns, including input and output token distributions and concurrency levels. You can use inline specifications or a representative dataset from Amazon S3.

1. **Set your goal.** Choose a single performance objective: optimize for cost, minimize latency, or maximize throughput. Select up to three instance types to compare.

1. **Review results.** SageMaker AI returns validated configurations with real performance metrics: Time to First Token (TTFT), inter-token latency, request latency at P50/P90/P99, throughput, and cost per configuration. Each configuration is deployment-ready.

1. **Deploy.** Deploy the chosen configuration to a SageMaker AI inference endpoint with a single action from SageMaker AI Studio, or programmatically through the API.

You can also benchmark existing production endpoints to validate current performance or compare against new configurations.

## Use cases
<a name="generative-ai-inference-recommendations-use-cases"></a>

The following are common use cases for inference recommendations.
+ **Pre-deployment validation.** Optimize and benchmark a new model before committing to a production deployment. Validate how the model performs before you invest in scaling it.
+ **Regression testing after updates.** Validate performance after a container update, framework upgrade, or serving library release. Confirm that your configuration is still optimal before pushing to production.
+ **Right-sizing when conditions change.** When traffic patterns shift or new instance types become available, re-run inference recommendations in hours rather than restarting a weeks-long manual process.
+ **Model comparison.** Compare the performance and cost of different model variants across instance types to make an informed selection before production deployment.
+ **Cost optimization.** Benchmark existing production endpoints to identify over-provisioned infrastructure. Use the results to right-size and reduce recurring inference spend.

## Pricing
<a name="generative-ai-inference-recommendations-pricing"></a>

Inference recommendations has no additional service fee. You can use existing ML Reservations (Flexible Training Plans) at no additional compute cost, or use on-demand compute that is provisioned automatically.

## Supported Regions
<a name="generative-ai-inference-recommendations-regions"></a>

Inference recommendations is available in the following AWS Regions:
+ US East (N. Virginia)
+ US East (Ohio)
+ US West (Oregon)
+ Asia Pacific (Singapore)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Europe (Ireland)

# Set up a workload configuration for generative AI inference recommendations
<a name="generative-ai-inference-recommendations-workload-config"></a>

A workload configuration defines the traffic patterns and benchmark parameters that SageMaker AI uses when evaluating your model or endpoint. You create a workload configuration before running a recommendation job or a benchmark job. The same workload configuration can be reused across multiple jobs.

You can define your workload in two ways:
+ **Inline specification.** Specify token distributions and traffic parameters directly in the API call.
+ **Dataset from Amazon S3.** Provide a representative dataset of real requests using the `DatasetConfig` parameter.

## Create a workload configuration with inline parameters
<a name="generative-ai-inference-recommendations-workload-config-inline"></a>

Use inline parameters to specify token distributions when you don't have a representative dataset.

**Python (boto3)**

```
import boto3
import json

client = boto3.client("sagemaker", region_name="us-west-2")

workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "prompt_input_tokens_mean": 550,
        "prompt_input_tokens_stddev": 150,
        "output_tokens_mean": 150,
        "output_tokens_stddev": 50,
    },
}

response = client.create_ai_workload_config(
    AIWorkloadConfigName="my-workload-config",
    AIWorkloadConfigs={
        "WorkloadSpec": {"Inline": json.dumps(workload_spec)}
    },
)
print(response["AIWorkloadConfigArn"])
```

**AWS CLI**

```
aws sagemaker create-ai-workload-config \
  --ai-workload-config-name "my-workload-config" \
  --ai-workload-configs '{"WorkloadSpec": {"Inline": "{\"benchmark\": {\"type\": \"aiperf\"}, \"parameters\": {\"prompt_input_tokens_mean\": 550, \"output_tokens_mean\": 150}}"}}' \
  --region us-west-2
```

## Create a workload configuration with a dataset
<a name="generative-ai-inference-recommendations-workload-config-dataset"></a>

If you have a representative dataset of real requests, provide it through Amazon S3 using the `DatasetConfig` parameter with an `InputDataConfig` channel.

```
response = client.create_ai_workload_config(
    AIWorkloadConfigName="my-dataset-workload",
    DatasetConfig={
        "InputDataConfig": [
            {
                "ChannelName": "traffic",
                "DataSource": {
                    "S3DataSource": {
                        "S3Uri": "s3://DOC-EXAMPLE-BUCKET/datasets/traffic-trace/"
                    }
                }
            }
        ]
    },
    AIWorkloadConfigs={
        "WorkloadSpec": {"Inline": json.dumps(workload_spec)}
    },
)
```

By default, synthetic prompts are generated. You can also use a public dataset or provide a custom dataset from Amazon S3.

## Workload configuration for benchmarking
<a name="generative-ai-inference-recommendations-workload-config-benchmark"></a>

When creating a workload configuration for benchmarking an existing endpoint, you can specify additional parameters such as the tokenizer, concurrency, request count, and request rate.

```
workload_spec = {
    "benchmark": {"type": "aiperf"},
    "parameters": {
        "tokenizer": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        "concurrency": 1,
        "request_count": 10,
        "streaming": True,
        "prompt_input_tokens_mean": 550,
        "prompt_input_tokens_stddev": 150,
        "output_tokens_mean": 50,
        "output_tokens_stddev": 10,
        "request_rate": 1.0,
        "benchmark_duration": 60,
    },
    "tooling": {"api_standard": "openai", "version": "0.6.0"},
}
```

## Manage workload configurations
<a name="generative-ai-inference-recommendations-workload-config-manage"></a>

Use the following operations to manage your workload configurations.

```
# List workload configurations
response = client.list_ai_workload_configs(MaxResults=10)
for config in response["AIWorkloadConfigs"]:
    print(f"{config['AIWorkloadConfigName']} - {config['AIWorkloadConfigArn']}")

# Describe a workload configuration
response = client.describe_ai_workload_config(
    AIWorkloadConfigName="my-workload-config"
)

# Delete a workload configuration
client.delete_ai_workload_config(
    AIWorkloadConfigName="my-workload-config"
)
```

# Get generative AI inference deployment recommendations
<a name="generative-ai-inference-recommendations-get-started"></a>

AI recommendation jobs analyze your model and workload characteristics to generate deployment configurations optimized for cost, latency, or throughput. The service evaluates instance types, applies optimizations like speculative decoding, and benchmarks each configuration on real GPU infrastructure.

## Prerequisites
<a name="generative-ai-inference-recommendations-get-started-prereqs"></a>

Before you create a recommendation job, you need the following:
+ Model artifacts stored in Amazon S3 in HuggingFace checkpoint format with SafeTensor weights
+ An Amazon S3 bucket for recommendation output
+ An AWS Identity and Access Management (IAM) execution role that grants SageMaker AI access to your model artifacts and output bucket

## Step 1: Create a recommendation job
<a name="generative-ai-inference-recommendations-get-started-create"></a>

A recommendation job analyzes your model and generates deployment recommendations. You specify the model location, output location, workload configuration, and a performance target.

**Python (boto3)**

```
response = client.create_ai_recommendation_job(
    AIRecommendationJobName="my-recommendation-job",
    ModelSource={
        "S3": {
            "S3Uri": "s3://DOC-EXAMPLE-BUCKET/models/my-model/",
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/recommendations/"
    },
    PerformanceTarget={
        "Constraints": [
            {"Metric": "ttft-ms"}
        ]
    },
    AIWorkloadConfigIdentifier="my-recommendation-workload",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)
print(response["AIRecommendationJobArn"])
```

**AWS CLI**

```
aws sagemaker create-ai-recommendation-job \
  --ai-recommendation-job-name "my-recommendation-job" \
  --model-source '{"S3": {"S3Uri": "s3://DOC-EXAMPLE-BUCKET/models/my-model/"}}' \
  --output-config '{"S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/recommendations/"}' \
  --performance-target '{"Constraints": [{"Metric": "ttft-ms"}]}' \
  --ai-workload-config-identifier "my-recommendation-workload" \
  --role-arn "arn:aws:iam::111122223333:role/ExampleRole" \
  --region us-west-2
```

You can also specify the following optional parameters:

`ComputeSpec`  
Restrict the instance types to evaluate (maximum three). For example: `{"InstanceTypes": ["ml.g5.12xlarge", "ml.p4d.24xlarge"]}`

`OptimizeModel`  
Set to `true` to allow model optimizations such as speculative decoding.

`InferenceSpecification`  
Specify the inference framework. Valid values: `LMI`, `VLLM`.

## Step 2: Monitor job status
<a name="generative-ai-inference-recommendations-get-started-monitor"></a>

Poll the job status until it reaches a terminal state.

**Python (boto3)**

```
import time

while True:
    response = client.describe_ai_recommendation_job(
        AIRecommendationJobName="my-recommendation-job"
    )
    status = response["AIRecommendationJobStatus"]
    print(f"Status: {status}")
    if status in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(30)
```

**AWS CLI**

```
aws sagemaker describe-ai-recommendation-job \
  --ai-recommendation-job-name "my-recommendation-job" \
  --region us-west-2
```

## Step 3: Review recommendations
<a name="generative-ai-inference-recommendations-get-started-results"></a>

When the job completes, the describe response includes a `Recommendations` array. Each recommendation contains a deployment-ready configuration with the following information:

`DeploymentConfiguration`  
Container image URI, instance type, instance count, and environment variables. You can use this configuration to deploy directly to a SageMaker AI endpoint.

`ExpectedPerformance`  
Validated performance metrics including Time to First Token (TTFT), request latency at P90 and P99, throughput in tokens per second, and request throughput.

`OptimizationDetails`  
Applied optimization techniques such as speculative decoding or kernel tuning, with their configuration parameters.

The following performance target metrics are available:

`ttft-ms`  
Time to first token in milliseconds.

`throughput`  
Tokens per second.

`cost`  
Cost per hour of the deployment configuration.

Each metric in the `ExpectedPerformance` response includes a `Stat` field indicating the statistical measure, a `Value`, and an optional `Unit`. Common statistics include: `average`, `p50`, `p90`, `p95`, `p99`, `max`, and `min`.

## Manage recommendation resources
<a name="generative-ai-inference-recommendations-get-started-manage"></a>

Use the following operations to manage your recommendation jobs and workload configurations.

```
# List recommendation jobs
response = client.list_ai_recommendation_jobs(MaxResults=10)
for job in response["AIRecommendationJobs"]:
    print(f"{job['AIRecommendationJobName']} - {job['AIRecommendationJobStatus']}")

# Stop a running job
client.stop_ai_recommendation_job(
    AIRecommendationJobName="my-recommendation-job"
)

# Delete a job (must be in a terminal state)
client.delete_ai_recommendation_job(
    AIRecommendationJobName="my-recommendation-job"
)

# List workload configurations
response = client.list_ai_workload_configs(MaxResults=10)
for config in response["AIWorkloadConfigs"]:
    print(f"{config['AIWorkloadConfigName']} - {config['AIWorkloadConfigArn']}")

# Delete a workload configuration
client.delete_ai_workload_config(
    AIWorkloadConfigName="my-recommendation-workload"
)
```

# Benchmark generative AI inference endpoints
<a name="generative-ai-inference-recommendations-benchmark"></a>

The SageMaker AI benchmarking service measures the performance of large language models (LLMs) hosted on SageMaker AI endpoints. It runs benchmarks using NVIDIA AIPerf, producing metrics such as request latency, throughput, time to first token, and inter-token latency.

## Prerequisites
<a name="generative-ai-inference-recommendations-benchmark-prereqs"></a>

Before you create a benchmark job, you need the following:
+ A SageMaker AI endpoint in `InService` status hosting an LLM that supports the OpenAI-compatible chat completions API
+ An Amazon S3 bucket for benchmark output
+ An IAM execution role that grants SageMaker AI access to your endpoint and output bucket

## Step 1: Create a benchmark job
<a name="generative-ai-inference-recommendations-benchmark-create"></a>

A benchmark job targets a specific SageMaker AI endpoint and references a workload configuration.

**Python (boto3)**

```
response = client.create_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job",
    BenchmarkTarget={
        "Endpoint": {
            "Identifier": "my-sagemaker-endpoint"
        }
    },
    OutputConfig={
        "S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"
    },
    AIWorkloadConfigIdentifier="my-benchmark-config",
    RoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)
print(response["AIBenchmarkJobArn"])
```

**AWS CLI**

```
aws sagemaker create-ai-benchmark-job \
  --ai-benchmark-job-name "my-benchmark-job" \
  --benchmark-target '{"Endpoint": {"Identifier": "my-sagemaker-endpoint"}}' \
  --output-config '{"S3OutputLocation": "s3://DOC-EXAMPLE-BUCKET/benchmark-results/"}' \
  --ai-workload-config-identifier "my-benchmark-config" \
  --role-arn "arn:aws:iam::111122223333:role/ExampleRole" \
  --region us-west-2
```

If your endpoint hosts multiple models through inference components, you can specify them in the `InferenceComponents` parameter of the `BenchmarkTarget`.

If your endpoint is in a VPC, pass the `NetworkConfig` parameter with your `VpcConfig` settings, including security group IDs and subnets.

## Step 2: Monitor job status
<a name="generative-ai-inference-recommendations-benchmark-monitor"></a>

Poll the job status until it reaches a terminal state.

**Python (boto3)**

```
import time

while True:
    response = client.describe_ai_benchmark_job(
        AIBenchmarkJobName="my-benchmark-job"
    )
    status = response["AIBenchmarkJobStatus"]
    print(f"Status: {status}")
    if status in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(30)

if status == "Completed":
    print(f"Results at: {response['OutputConfig']['S3OutputLocation']}")
elif status == "Failed":
    print(f"Job failed: {response.get('FailureReason', 'unknown')}")
```

**AWS CLI**

```
aws sagemaker describe-ai-benchmark-job \
  --ai-benchmark-job-name "my-benchmark-job" \
  --region us-west-2
```

## Step 3: Review benchmark results
<a name="generative-ai-inference-recommendations-benchmark-results"></a>

Benchmark results are written to the Amazon S3 output location that you specified. The results include the following key metrics:

`request_throughput`  
Requests per second.

`request_latency`  
End-to-end request latency with percentile breakdowns (P50, P90, P99).

`time_to_first_token`  
Time from request submission to the first token received.

`inter_token_latency`  
Time between consecutive output tokens.

`output_token_throughput`  
Output tokens generated per second.

Each metric includes statistical summaries: average, minimum, maximum, P50, P90, P99, and standard deviation.

## Manage benchmark resources
<a name="generative-ai-inference-recommendations-benchmark-manage"></a>

Use the following operations to manage your benchmark jobs and workload configurations.

```
# List benchmark jobs
response = client.list_ai_benchmark_jobs(MaxResults=10)
for job in response["AIBenchmarkJobs"]:
    print(f"{job['AIBenchmarkJobName']} - {job['AIBenchmarkJobStatus']}")

# Stop a running job
client.stop_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job"
)

# Delete a job (must be in a terminal state)
client.delete_ai_benchmark_job(
    AIBenchmarkJobName="my-benchmark-job"
)

# List workload configurations
response = client.list_ai_workload_configs(MaxResults=10)
for config in response["AIWorkloadConfigs"]:
    print(f"{config['AIWorkloadConfigName']} - {config['AIWorkloadConfigArn']}")

# Delete a workload configuration
client.delete_ai_workload_config(
    AIWorkloadConfigName="my-benchmark-config"
)
```

# Deploy a generative AI inference recommendation
<a name="generative-ai-inference-recommendations-deploy"></a>

When a recommendation job completes, each recommendation includes a deployment-ready configuration. You can deploy the chosen configuration to a SageMaker AI inference endpoint with a single action from SageMaker AI Studio, or programmatically through the API.

## Understanding deployment configurations
<a name="generative-ai-inference-recommendations-deploy-overview"></a>

Each recommendation in the job response contains a `DeploymentConfiguration` object with the following information:

`ImageUri`  
The container image URI optimized for the recommended instance type.

`InstanceType`  
The recommended instance type for deployment.

`InstanceCount`  
The number of instances needed to meet the performance target.

`CopyCountPerInstance`  
The number of model copies to run per instance. When set to a value greater than one, multiple copies of the model are loaded on each instance to increase throughput.

`EnvironmentVariables`  
Environment variables configured for optimal performance, such as tensor parallel size and maximum sequence length.

`S3`  
S3 channel references for model artifacts, including any optimized model outputs.

## Deploy using the API
<a name="generative-ai-inference-recommendations-deploy-api"></a>

To deploy a recommendation programmatically, use the model package from the recommendation to create a SageMaker AI model and endpoint. Each recommendation includes a `ModelDetails` object with the model package ARN and inference specification name. This is the simplest deployment path because the model package already contains the container image, environment variables, and model artifact channels.

```
import boto3

client = boto3.client("sagemaker", region_name="us-west-2")

# Get the recommendation from a completed job
response = client.describe_ai_recommendation_job(
    AIRecommendationJobName="my-recommendation-job"
)

# Select a recommendation (e.g., the first one)
recommendation = response["Recommendations"][0]
model_details = recommendation["ModelDetails"]
deploy_config = recommendation["DeploymentConfiguration"]

# Create a model from the model package.
# The model package already contains the container image, environment
# variables, and S3 data channels (base model + optimization artifacts).
model_name = "my-recommended-model"
container_def = {
    "ModelPackageName": model_details["ModelPackageArn"],
}
# If the recommendation uses a named inference specification (e.g., for
# a specific optimization variant), specify it so SageMaker selects the
# correct container and instance configuration from the model package.
if model_details.get("InferenceSpecificationName"):
    container_def["InferenceSpecificationName"] = model_details["InferenceSpecificationName"]

client.create_model(
    ModelName=model_name,
    PrimaryContainer=container_def,
    ExecutionRoleArn="arn:aws:iam::111122223333:role/ExampleRole",
)

# Create an endpoint configuration
endpoint_config_name = "my-recommended-endpoint-config"
production_variant = {
    "VariantName": "AllTraffic",
    "ModelName": model_name,
    "InstanceType": deploy_config["InstanceType"],
    "InitialInstanceCount": deploy_config.get("InstanceCount", 1),
}
copy_count = deploy_config.get("CopyCountPerInstance")
if copy_count and copy_count > 1:
    production_variant["InferenceAmiVersion"] = "al2-ami-sagemaker-inference-gpu-2"
    production_variant["RoutingConfig"] = {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"}

client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[production_variant],
)

# Create the endpoint
endpoint_name = "my-recommended-endpoint"
client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
print(f"Endpoint {endpoint_name} is being created.")
```

After the endpoint is created, you can monitor its status using the `DescribeEndpoint` API until it reaches `InService` status.

```
import time

while True:
    response = client.describe_endpoint(EndpointName=endpoint_name)
    status = response["EndpointStatus"]
    print(f"Endpoint status: {status}")
    if status in ("InService", "Failed"):
        break
    time.sleep(60)
```

## Deploy from SageMaker AI Studio
<a name="generative-ai-inference-recommendations-deploy-studio"></a>

You can also deploy a recommended configuration directly from SageMaker AI Studio with a single action. In SageMaker AI Studio, navigate to the completed recommendation job, review the recommendations and their performance metrics, and choose the configuration you want to deploy.

# Security in Amazon SageMaker AI inference optimization
<a name="inference-recommendations-security"></a>

Cloud security at AWS is the highest priority. As an AWS customer, you benefit from a data center and network architecture that is built to meet the requirements of the most security-sensitive organizations.

Security is a shared responsibility between AWS and you. The [shared responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/) describes this as security *of* the cloud and security *in* the cloud:
+ **Security of the cloud** – AWS is responsible for protecting the infrastructure that runs AWS services in the AWS Cloud. AWS also provides you with services that you can use securely. Third-party auditors regularly test and verify the effectiveness of our security as part of the [AWS compliance programs](https://aws.amazon.com/compliance/programs/). To learn about the compliance programs that apply to Amazon SageMaker AI, see [AWS Services in Scope by Compliance Program](https://aws.amazon.com/compliance/services-in-scope/).
+ **Security in the cloud** – Your responsibility is determined by the AWS service that you use. You are also responsible for other factors including the sensitivity of your data, your company's requirements, and applicable laws and regulations.

This documentation helps you understand how to apply the shared responsibility model when using SageMaker AI inference optimization features, including AI benchmarking jobs, AI recommendation jobs, and AI workload configurations.

## Data protection
<a name="inference-optimization-data-protection"></a>

The AWS [shared responsibility model](https://aws.amazon.com/compliance/shared-responsibility-model/) applies to data protection in Amazon SageMaker AI inference optimization. As described in this model, AWS is responsible for protecting the global infrastructure that runs all of the AWS Cloud. You are responsible for maintaining control over your content that is hosted on this infrastructure.

For data protection purposes, we recommend that you protect AWS account credentials and set up individual users with AWS IAM Identity Center or AWS Identity and Access Management (IAM). That way, each user is given only the permissions necessary to fulfill their job duties. We also recommend that you secure your data in the following ways:
+ Use multi-factor authentication (MFA) with each account.
+ Use SSL/TLS to communicate with AWS resources. We require TLS 1.2 and recommend TLS 1.3.
+ Set up API and user activity logging with AWS CloudTrail.
+ Use AWS encryption solutions, along with all default security controls within AWS services.
+ Use advanced managed security services such as Amazon Macie, which assists in discovering and securing sensitive data that is stored in Amazon S3.

We strongly recommend that you never put confidential or sensitive information, such as your customers' email addresses, into tags or free-form text fields such as a **Name** field.

### What data SageMaker AI inference optimization stores
<a name="inference-optimization-data-stored"></a>

SageMaker AI inference optimization stores the following types of data:
+ **Job metadata** – When you create AI benchmark jobs or AI recommendation jobs, the service stores job configuration metadata such as job names, status, creation timestamps, and resource configuration parameters.
+ **Workload configurations** – When you create AI workload configurations, the service stores the configuration parameters you provide, including benchmark parameters, dataset configuration, and tags.
+ **Benchmark results and recommendations** – Job outputs such as performance metrics, cost estimates, and deployment recommendations are stored as job metadata within the service.

SageMaker AI inference optimization does not store your model weights, training data, or inference results. Your model artifacts and benchmark output files remain in your Amazon S3 buckets within your AWS account.

### Encryption at rest
<a name="inference-optimization-encryption-rest"></a>

SageMaker AI inference optimization encrypts all stored data at rest by default. Job metadata and workload configurations are stored in Amazon DynamoDB, with encryption at rest. You do not need to take any action to enable encryption at rest.

### Encryption in transit
<a name="inference-optimization-encryption-transit"></a>

SageMaker AI inference optimization uses TLS to encrypt all data in transit. API requests to the service are made over HTTPS using TLS 1.2 or later.

All communication between SageMaker AI inference optimization and other AWS services (such as Amazon DynamoDB, AWS Lambda, Amazon S3, and AWS Secrets Manager) uses TLS-encrypted connections.

### Internetwork traffic privacy
<a name="inference-optimization-traffic-privacy"></a>

SageMaker AI inference optimization API endpoints are accessible over the public internet using HTTPS. You can use VPC endpoints for SageMaker AI API to keep traffic between your VPC and the SageMaker AI API within the AWS network, without traversing the public internet.

When you provide a VPC configuration for your AI benchmark jobs, the service creates resources within your specified VPC subnets and security groups.

## Identity and Access Management
<a name="inference-optimization-iam"></a>

Amazon SageMaker AI inference optimization uses AWS Identity and Access Management (IAM) to control access to its resources and operations.

### How SageMaker AI inference optimization works with IAM
<a name="inference-optimization-iam-how"></a>

SageMaker AI inference optimization is accessed through the SageMaker AI API. All API calls are authenticated and authorized using IAM.

The inference optimization APIs use the following IAM action namespace:
+ `sagemaker:CreateAIWorkloadConfig`
+ `sagemaker:DescribeAIWorkloadConfig`
+ `sagemaker:ListAIWorkloadConfigs`
+ `sagemaker:DeleteAIWorkloadConfig`
+ `sagemaker:CreateAIBenchmarkJob`
+ `sagemaker:DescribeAIBenchmarkJob`
+ `sagemaker:ListAIBenchmarkJobs`
+ `sagemaker:StopAIBenchmarkJob`
+ `sagemaker:DeleteAIBenchmarkJob`
+ `sagemaker:CreateAIRecommendationJob`
+ `sagemaker:DescribeAIRecommendationJob`
+ `sagemaker:ListAIRecommendationJobs`
+ `sagemaker:StopAIRecommendationJob`
+ `sagemaker:DeleteAIRecommendationJob`

### Execution roles
<a name="inference-optimization-execution-roles"></a>

When you create an AI benchmark job or AI recommendation job, you provide an IAM execution role (`RoleArn`). The service assumes this role to perform operations in your AWS account, such as:
+ Creating and managing SageMaker AI training jobs, endpoints, and optimization jobs
+ Reading model artifacts from Amazon S3
+ Writing benchmark results to Amazon S3
+ Accessing secrets from AWS Secrets Manager

The execution role must have a trust policy that allows the SageMaker AI service to assume it. For more information about creating SageMaker AI execution roles, see [SageMaker AI Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

### Resource isolation
<a name="inference-optimization-resource-isolation"></a>

SageMaker AI inference optimization enforces account-level isolation. Each job and workload configuration is scoped to the AWS account that created it. You cannot access or modify resources belonging to another AWS account.

All SageMaker AI resources created by the service (training jobs, endpoints, optimization jobs) are created in your AWS account using your execution role, and are subject to your account's IAM policies and service quotas.

## Security best practices
<a name="inference-optimization-best-practices"></a>

The following best practices are general guidelines and don't represent a complete security solution. Because these best practices might not be appropriate or sufficient for your environment, treat them as helpful considerations rather than prescriptions.

### Preventative best practices
<a name="inference-optimization-preventative"></a>
+ **Use least privilege for IAM policies.** Grant only the minimum permissions required for users and execution roles. Avoid using wildcard (`*`) actions or resources in IAM policies.
+ **Use separate execution roles for different workloads.** Create dedicated IAM execution roles for benchmark jobs and recommendation jobs rather than sharing a single role across all workloads.
+ **Use AWS Secrets Manager for sensitive values.** When your workload specification requires sensitive values such as Hugging Face access tokens, use the `secrets` field to reference AWS Secrets Manager secrets by ARN instead of passing them as plaintext parameters.
+ **Restrict execution role trust policies.** Use `aws:SourceAccount` and `aws:SourceArn` conditions in your execution role trust policies to prevent the confused deputy problem.
+ **Scope Amazon S3 permissions to specific buckets.** Restrict `s3:GetObject` and `s3:PutObject` permissions to the specific Amazon S3 buckets and prefixes used for model artifacts and benchmark outputs.
+ **Enable Amazon S3 bucket encryption.** Ensure that the Amazon S3 buckets used for model artifacts and benchmark results have server-side encryption enabled.
+ **Use tags for access control.** Apply tags to your AI workload configurations, benchmark jobs, and recommendation jobs. You can use tag-based conditions in IAM policies to control access to specific resources.

### Detective best practices
<a name="inference-optimization-detective"></a>
+ **Enable AWS CloudTrail.** CloudTrail provides a record of all SageMaker AI API calls made in your account, including inference optimization operations.
+ **Monitor with Amazon CloudWatch.** Use Amazon CloudWatch metrics and alarms to monitor the status and performance of your benchmark and recommendation jobs.
+ **Review IAM Access Analyzer findings.** Use IAM Access Analyzer to identify IAM policies that grant overly broad access to your SageMaker AI resources.
+ **Enable Amazon S3 access logging.** Enable server access logging on Amazon S3 buckets used for model artifacts and benchmark results to track access patterns.