Evaluate with Inspect AI Container
The SageMaker Inspect AI container runs LLM model evaluations on SageMaker Training Jobs. The container uses Inspect AI
Previous evaluation approaches (based on lighteval
Overview
Key benefits include:
-
Bring your own benchmarks — write evaluation tasks in the Inspect AI format, then plug in domain-specific evaluation tasks without depending on a centralized team to onboard them.
-
Evaluate with different inference options — works with SageMaker Inference (existing endpoint or create on-the-fly), Amazon Bedrock, and more inference backends incoming.
-
Iterate faster — go from benchmark development to production evaluation without infrastructure changes. New benchmark onboarding that previously took days happens in minutes.
-
Run at scale — chain multiple benchmarks in one job, mix standard benchmarks from the inspect-evals library with your own custom tasks in the same job.
-
One entry point for all training techniques — whether your model was fine-tuned with SFT (SMTJ, SMHP), CPT (SMHP), or RFT (SMTJ, SMHP), the container evaluates it through the same interface. Evaluating mid-training checkpoints saved at specific steps is also supported (for example, step 500, step 1000) by pointing
model_s3_uriat the checkpoint path.
How it works
You provide two inputs to the container:
-
A YAML configuration file (recipe) that defines the inference provider, benchmarks, and evaluation parameters
-
Benchmark files (Python scripts with the
@taskdecorator) uploaded to Amazon S3
The container handles endpoint management, evaluation execution, and result collection. When the training job completes, results are written to your specified S3 output location.
Container image
The following table lists the Inspect AI container image URIs by AWS Region.
| Region | Container image URI |
|---|---|
| us-east-1 | 763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-inspect-ai:latest |
| us-west-2 | 763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-inspect-ai:latest |
| eu-west-2 | 763104351884.dkr.ecr.eu-west-2.amazonaws.com/sagemaker-inspect-ai:latest |
Prerequisites
Before you begin, ensure you have the following resources and access.
| Requirement | Description |
|---|---|
| AWS account with SageMaker access | An active AWS account with permissions to create SageMaker Training Jobs |
| S3 bucket | A bucket to store your evaluation recipes, benchmark files, and output results |
| IAM execution role | A role that SageMaker can assume to access your resources |
| SageMaker inference endpoint or Amazon Bedrock access | A deployed model endpoint or Amazon Bedrock model access for the model you want to evaluate |
| AWS CLI or SageMaker Python SDK | Tools to submit training jobs and manage resources |
| Capacity reservation (large models) | For models that require accelerated instances (such as p5 for Nova Lite 2), contact AWS Support to reserve capacity for SageMaker Inference or Amazon Bedrock |
Step 1: Set up IAM permissions
The SageMaker Training Job runs under an execution role that you provide. This role needs permissions to read benchmarks from S3, write results, invoke the inference endpoint, and write CloudWatch logs.
1.1 Create the trust policy
Save the following as trust_policy.json:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
1.2 Create the role
aws iam create-role \ --role-name InspectLensEvalRole \ --assume-role-policy-document file://trust_policy.json
1.3 Attach the permissions policy
Save the following as eval_policy.json, replacing all placeholder values (REGION, ACCOUNT_ID, bucket names) with your values:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "S3ReadBenchmarkData", "Effect": "Allow", "Action": ["s3:GetObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::YOUR_BENCHMARK_BUCKET", "arn:aws:s3:::YOUR_BENCHMARK_BUCKET/*" ] }, { "Sid": "S3WriteResults", "Effect": "Allow", "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"], "Resource": [ "arn:aws:s3:::YOUR_RESULTS_BUCKET", "arn:aws:s3:::YOUR_RESULTS_BUCKET/*" ] }, { "Sid": "SageMakerInvokeExistingEndpoint", "Effect": "Allow", "Action": [ "sagemaker:InvokeEndpoint", "sagemaker:InvokeEndpointWithResponseStream", "sagemaker:DescribeEndpoint" ], "Resource": "arn:aws:sagemaker:REGION:ACCOUNT_ID:endpoint/*" }, { "Sid": "BedrockInference", "Effect": "Allow", "Action": [ "bedrock:InvokeModel", "bedrock:InvokeModelWithResponseStream" ], "Resource": [ "arn:aws:bedrock:REGION::foundation-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:inference-profile/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:provisioned-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:custom-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:imported-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:custom-model-deployment/*" ] }, { "Sid": "BedrockCustomModelAccess", "Effect": "Allow", "Action": [ "bedrock:GetCustomModel", "bedrock:GetImportedModel", "bedrock:GetProvisionedModelThroughput", "bedrock:GetCustomModelDeployment", "bedrock:GetInferenceProfile" ], "Resource": [ "arn:aws:bedrock:REGION:ACCOUNT_ID:custom-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:imported-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:provisioned-model/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:custom-model-deployment/*", "arn:aws:bedrock:REGION:ACCOUNT_ID:inference-profile/*" ] }, { "Sid": "SageMakerCreateEndpointLifecycle", "Effect": "Allow", "Action": [ "sagemaker:CreateModel", "sagemaker:DescribeModel", "sagemaker:DeleteModel", "sagemaker:CreateEndpointConfig", "sagemaker:DescribeEndpointConfig", "sagemaker:DeleteEndpointConfig", "sagemaker:CreateEndpoint", "sagemaker:DescribeEndpoint", "sagemaker:DeleteEndpoint", "sagemaker:InvokeEndpoint", "sagemaker:InvokeEndpointWithResponseStream" ], "Resource": [ "arn:aws:sagemaker:REGION:ACCOUNT_ID:model/inspectlens-*", "arn:aws:sagemaker:REGION:ACCOUNT_ID:endpoint/inspectlens-*", "arn:aws:sagemaker:REGION:ACCOUNT_ID:endpoint-config/inspectlens-*" ] }, { "Sid": "ECRAuth", "Effect": "Allow", "Action": "ecr:GetAuthorizationToken", "Resource": "*" }, { "Sid": "PassRoleToSageMaker", "Effect": "Allow", "Action": "iam:PassRole", "Resource": "arn:aws:iam::ACCOUNT_ID:role/InspectLensEvalRole", "Condition": { "StringEquals": { "iam:PassedToService": "sagemaker.amazonaws.com" } } }, { "Sid": "VPCNetworkingForEndpoint", "Effect": "Allow", "Action": [ "ec2:CreateNetworkInterface", "ec2:CreateNetworkInterfacePermission", "ec2:DeleteNetworkInterface", "ec2:DeleteNetworkInterfacePermission", "ec2:DescribeNetworkInterfaces", "ec2:DescribeVpcs", "ec2:DescribeDhcpOptions", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups" ], "Resource": "*" }, { "Sid": "CloudWatchLogs", "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogStreams" ], "Resource": [ "arn:aws:logs:REGION:ACCOUNT_ID:log-group:/aws/sagemaker/TrainingJobs:*", "arn:aws:logs:REGION:ACCOUNT_ID:log-group:/aws/sagemaker/Endpoints/*" ] }, { "Sid": "MLflowTrackingServer", "Effect": "Allow", "Action": [ "sagemaker:DescribeMlflowTrackingServer", "sagemaker:CreatePresignedMlflowTrackingServerUrl" ], "Resource": "arn:aws:sagemaker:REGION:ACCOUNT_ID:mlflow-tracking-server/*" }, { "Sid": "MLflowTrackingOperations", "Effect": "Allow", "Action": [ "sagemaker-mlflow:AccessUI", "sagemaker-mlflow:CreateExperiment", "sagemaker-mlflow:GetExperiment", "sagemaker-mlflow:GetExperimentByName", "sagemaker-mlflow:SearchExperiments", "sagemaker-mlflow:CreateRun", "sagemaker-mlflow:GetRun", "sagemaker-mlflow:UpdateRun", "sagemaker-mlflow:SearchRuns", "sagemaker-mlflow:LogMetric", "sagemaker-mlflow:LogParam", "sagemaker-mlflow:LogBatch", "sagemaker-mlflow:SetTag", "sagemaker-mlflow:LogArtifact", "sagemaker-mlflow:ListArtifacts" ], "Resource": "arn:aws:sagemaker:REGION:ACCOUNT_ID:mlflow-tracking-server/*" }, { "Sid": "KMSForVolumeEncryption", "Effect": "Allow", "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant", "kms:DescribeKey" ], "Resource": "arn:aws:kms:REGION:ACCOUNT_ID:key/*" } ] }
Attach the policy to the role:
aws iam put-role-policy \ --role-name InspectLensEvalRole \ --policy-name InspectLensEvalPolicy \ --policy-document file://eval_policy.json
Step 2: Write your eval recipe
The eval recipe is a YAML configuration file that defines how the container runs your evaluations. The recipe specifies the inference provider, benchmarks, evaluation parameters, and output settings.
For end-to-end examples, see the following notebooks:
Option A: Evaluate an existing SageMaker endpoint
Use this option when you have a model already deployed on a SageMaker inference endpoint.
inference_provider: sagemaker_endpoint: endpoint_name: "my-existing-endpoint" region: "us-east-1" benchmarks: s3_path: "s3://your-bucket/benchmarks/my_benchmarks/" tasks: - name: mmlu eval: max_connections: 10 max_retries: 100 timeout: 600 output: s3_path: "s3://your-bucket/eval-results/"
Option B: Create endpoint, evaluate, then clean up
Use this option to have the container deploy a Amazon Nova base or fine-tuned model, run evaluations, and tear down the endpoint automatically. This is the recommended approach for one-off evaluation runs. Retrieve the latest SageMaker inference container from the Amazon Nova SageMaker Inference container images documentation.
inference_provider: sagemaker_endpoint: endpoint_name: null # null = create new endpoint model_s3_uri: "s3://your-bucket/models/nova-micro/" inference_image_uri: "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-inference-repo:SM-Inference-latest" instance_type: "ml.p5.48xlarge" instance_count: 1 execution_role_arn: "arn:aws:iam::ACCOUNT_ID:role/InspectLensEvalRole" region: "us-east-1" context_length: "16000" max_concurrency: "32" cleanup_endpoint: true # Auto-delete after eval benchmarks: s3_path: "s3://your-bucket/benchmarks/my_benchmarks/" tasks: - name: mmlu_pro - name: arc_c - name: boolq eval: fail_on_error: false decoding: temperature: 0.0 top_p: 1.0 top_k: -1 max_tokens: 16000 reasoning_effort: null max_connections: 16 max_retries: 100 timeout: 600 extra_args: - "-M" - "completion_mode=True" - "--logprobs" - "--top-logprobs" - "5" output: s3_path: "s3://your-bucket/eval-results/"
Note on model_s3_uri:
-
Amazon Nova GA models (base checkpoints): For example,
s3://escrow-nova-model-708977205387-us-east-1/nova-lite-2/prod/— SageMaker manages access automatically, no additional S3 permissions needed. -
Customized Amazon Nova models (post-training checkpoints):
s3://customer-escrow-ACCOUNT_ID-SUFFIX/YOUR_RUN_NAME/outputs/checkpoints/step_N/— this is the escrow bucket path from your training job output.
Option C: Evaluate through Amazon Bedrock
Use this option to evaluate a model available through Amazon Bedrock without managing an endpoint.
inference_provider: bedrock: model_id: amazon.nova-pro-v1:0 region: us-east-1 benchmarks: - name: mmlu path: benchmarks/mmlu_pro.py limit: 100 eval: fail_on_error: false decoding: temperature: 0.0 top_p: 1.0 top_k: -1 max_tokens: 8192 reasoning_effort: null max_connections: 10 max_retries: 50 timeout: 600 extra_args: - "--display" - "plain" output: s3_path: s3://your-bucket/eval/output/
Benchmarks configuration
The benchmarks section defines which evaluation tasks to run. You can chain multiple benchmarks in a single job.
| Field | Required | Default | Description |
|---|---|---|---|
name |
Yes | — | A descriptive name for the benchmark run |
path |
Yes | — | Relative path to the benchmark Python file in your S3 benchmarks directory |
limit |
No | None (all samples) | Maximum number of samples to evaluate. Use for testing before full runs. |
epochs |
No | 1 | Number of times to repeat the evaluation for statistical significance |
task_args |
No | — | Key-value pairs passed as arguments to the benchmark task function |
Eval configuration
The eval section controls how the container executes evaluations.
| Parameter | Required | Default | Description |
|---|---|---|---|
fail_on_error |
No | false | Stop the evaluation if any sample fails. Set to true for strict validation. |
max_connections |
No | 10 | Number of parallel requests to the inference endpoint |
max_retries |
No | 3 | Number of retry attempts for failed inference requests |
timeout |
No | 600 | Request timeout in seconds for each inference call |
extra_args |
No | — | Additional key-value pairs passed directly to the Inspect AI eval command |
Decoding parameters
Configure model decoding parameters within the eval.decoding section:
| Parameter | Required | Default | Description |
|---|---|---|---|
temperature |
No | 0.0 | Controls randomness in generation. Use 0.0 for deterministic, reproducible benchmark results. |
top_p |
No | 1.0 | Nucleus sampling threshold. 1.0 means no restriction. |
top_k |
No | -1 | Limits word choices to the top K most likely tokens. -1 disables this filter. |
max_tokens |
No | 8192 | Maximum number of tokens to generate per response. Increase for benchmarks requiring long reasoning chains. |
reasoning_effort |
No | null | Controls reasoning depth for models that support it (for example, Amazon Nova models with extended thinking). Options: low, high, or null to disable. |
Output configuration
The output section defines where and how evaluation results are stored.
| Field | Required | Default | Description |
|---|---|---|---|
s3_path |
Yes | — | S3 URI where evaluation results are written |
output_format |
No | eval | Format for result files. See the following table for options. |
The following output formats are available:
| Format | Description |
|---|---|
eval |
Native Inspect AI format. Compatible with inspect view for interactive analysis. |
csv |
Comma-separated values. Suitable for spreadsheet analysis and data pipelines. |
jsonl |
JSON Lines format. One JSON object per line for streaming processing. |
json |
Standard JSON format. Complete results in a single structured file. |
MLflow configuration
Optionally, you can log evaluation metrics to an MLflow tracking server. Add the tracking section to your recipe:
tracking: mlflow_tracking_arn: null # ARN of SageMaker MLflow tracking server mlflow_experiment_name: "inspectlens" # experiment name mlflow_tracing: true # log full request/response traces mlflow_log_artifacts: true # upload .eval files to MLflow
| Field | Required | Default | Description |
|---|---|---|---|
mlflow_tracking_arn |
No | null | ARN of your SageMaker MLflow tracking server. Setting this enables MLflow logging. Omit or set to null to disable. |
mlflow_experiment_name |
No | inspectlens | Name of the MLflow experiment to log runs under. |
mlflow_tracing |
No | true | When true, logs full request/response traces for each sample. |
mlflow_log_artifacts |
No | true | When true, uploads .eval log files as MLflow artifacts. |
Full recipe reference
The following example shows a complete recipe with all available configuration options:
inference_provider: sagemaker_endpoint: endpoint_name: my-nova-endpoint model_s3_uri: s3://your-bucket/models/my-model/ inference_image_uri: "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-inference-repo:SM-Inference-latest" instance_type: ml.g5.12xlarge cleanup_endpoint: true region: us-west-2 benchmarks: - name: mmlu path: benchmarks/mmlu_pro.py limit: 100 epochs: 3 task_args: subject: math - name: truthfulqa path: benchmarks/truthfulqa.py limit: 50 eval: fail_on_error: false decoding: temperature: 0.0 top_p: 1.0 top_k: -1 max_tokens: 8192 reasoning_effort: null max_connections: 256 max_retries: 50 timeout: 600 extra_args: - "--display" - "plain" output: s3_path: s3://your-bucket/eval/output/ output_format: eval tracking: mlflow_tracking_arn: "arn:aws:sagemaker:us-west-2:123456789012:mlflow-tracking-server/my-server" mlflow_experiment_name: "inspectlens" mlflow_tracing: true mlflow_log_artifacts: true
Step 3: Prepare benchmark files
Benchmarks are Python files that use the Inspect AI @task decorator to define evaluation tasks. The inspect-evals repository
Example benchmark
The following example shows a minimal benchmark that evaluates multiple-choice performance on a HuggingFace dataset:
from inspect_ai import task, Task from inspect_ai.dataset import hf_dataset from inspect_ai.scorer import choice from inspect_ai.solver import multiple_choice @task def my_benchmark(): return Task( dataset=hf_dataset( path="cais/mmlu", name="abstract_algebra", split="test", ), solver=multiple_choice(), scorer=choice(), )
Dependencies
If your benchmark needs extra dependencies, include a pyproject.toml or requirements.txt in the same S3 directory:
[project] name = "my-benchmark" version = "0.1.0" requires-python = ">=3.12" dependencies = [ "datasets>=2.14.0", ]
Pre-installed packages
The container includes the following packages. You do not need to list these in your pyproject.toml.
| Package | Version |
|---|---|
| Python | 3.12 |
| inspect-ai | 0.3.220 |
| boto3 | 1.40.61 |
| aioboto3 | 15.5.0 |
| openai | 2.36.0 |
| mlflow | 3.12.0 |
| pyyaml | 6.0.3 |
Task selection
The tasks field in your recipe controls which tasks within a benchmark file to run.
| Configuration | Example | Behavior |
|---|---|---|
Empty tasks |
tasks: [] |
Runs all tasks defined in the benchmark file |
| Name filter | tasks: ["algebra"] |
Runs tasks whose name contains the substring "algebra" |
limit |
limit: 50 |
Caps the number of samples evaluated per task |
epochs |
epochs: 3 |
Repeats evaluation multiple times to measure variance |
Step 4: Prepare your S3 structure
The following structure is a recommendation for keeping configs, benchmarks, and results organized. You can point the container at any S3 location — the structure itself is not required.
s3://your-bucket/ ├── config/ │ └── inspect_config.yaml # Your eval recipe ├── benchmarks/ │ └── my_benchmarks/ # Your benchmark Python files │ ├── pyproject.toml # Optional: benchmark dependencies │ ├── my_task.py # @task decorated functions │ └── _helpers.py # Shared utilities └── output/ # Results written here
Upload your files to S3:
# Upload benchmark files aws s3 cp my_benchmarks/ s3://your-bucket/benchmarks/my_benchmarks/ --recursive # Upload your eval recipe aws s3 cp inspect_config.yaml s3://your-bucket/config/inspect_config.yaml
Step 5: Submit the training job
Submit a SageMaker Training Job to run your evaluation. You can use the AWS CLI or the SageMaker Python SDK.
Option A: AWS CLI
aws sagemaker create-training-job \ --training-job-name inspect-eval-$(date +%Y%m%d-%H%M%S) \ --algorithm-specification \ TrainingImage=763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-inspect-ai:latest,TrainingInputMode=File \ --role-arn arn:aws:iam::123456789012:role/SageMakerInspectAIRole \ --resource-config \ InstanceType=ml.m5.large,InstanceCount=1,VolumeSizeInGB=30 \ --stopping-condition MaxRuntimeInSeconds=86400 \ --input-data-config '[ { "ChannelName": "config", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://your-bucket/eval/config/", "S3DataDistributionType": "FullyReplicated" } } } ]' \ --output-data-config S3OutputPath=s3://your-bucket/eval/output/ \ --region us-west-2
Option B: SageMaker Python SDK V3
from sagemaker.train import ModelTrainer from sagemaker.train.configs import InputData, Compute from sagemaker.core.shapes.shapes import StoppingCondition, OutputDataConfig trainer = ModelTrainer( training_image="763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-inspect-ai:latest", compute=Compute( instance_type="ml.m5.large", instance_count=1, volume_size_in_gb=30, ), output_data_config=OutputDataConfig( s3_output_path="s3://your-bucket/eval/output/" ), role="arn:aws:iam::123456789012:role/SageMakerInspectAIRole", stopping_condition=StoppingCondition(max_runtime_in_seconds=86400), ) trainer.train( input_data_config=[ InputData( channel_name="config", data_source="s3://your-bucket/eval/config/", ) ] )
Key parameters
| Parameter | Value | Description |
|---|---|---|
TrainingImage |
763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-inspect-ai:latest |
The Inspect AI container image |
InstanceType |
ml.m5.large |
Orchestrator instance type (not the inference instance) |
VolumeSizeInGB |
30 |
Storage for benchmark data and logs |
MaxRuntimeInSeconds |
86400 |
Maximum job duration (24 hours) |
ChannelName |
config |
Input channel containing your recipe and benchmark files |
Orchestrator instance type guidance
The training job instance runs the evaluation orchestration logic, not the model inference. Choose an instance type based on your evaluation workload.
| Instance type | Use case | Guidance |
|---|---|---|
ml.m5.large |
Recommended default | Sufficient for most evaluation workloads with moderate parallelism |
ml.m5.4xlarge |
Large benchmark suites | Use when running many benchmarks with high max_connections values |
ml.c5.large |
Cost-sensitive | Lower cost alternative for simple evaluations with low parallelism |
Environment variables
You can pass environment variables to the container for authentication or configuration. Add the --environment parameter to the AWS CLI command:
aws sagemaker create-training-job \ ... --environment '{ "HF_TOKEN": "hf_your_token_here", "HF_HUB_DOWNLOAD_TIMEOUT": "300" }'
Step 6: Monitor the job
After you submit the training job, you can monitor its progress through the AWS CLI or CloudWatch Logs.
Check job status
aws sagemaker describe-training-job \ --training-job-name your-job-name \ --query "TrainingJobStatus" \ --output text
Stream logs
aws logs tail /aws/sagemaker/TrainingJobs \ --log-stream-name-prefix your-job-name \ --follow
What to expect in logs
The container logs show progress through the following stages:
-
Startup — Container initialization and configuration validation
-
Benchmark download — Downloading benchmark files and installing dependencies
-
Endpoint setup — Creating or connecting to the inference endpoint
-
Evaluation — Running benchmark tasks with progress indicators
-
Results upload — Writing results to S3 and optionally logging to MLflow
-
Cleanup — Deleting temporary endpoints if
cleanup_endpoint: true
Estimated timelines
| Stage | Estimated duration |
|---|---|
| Container startup | 2–5 minutes |
| Endpoint creation (if applicable) | 15–30 minutes |
| Evaluation | Varies by benchmark size and model latency |
| Cleanup | 1–2 minutes |
Step 7: View and interpret results
After the job completes, view your evaluation results.
View with Inspect AI
Use the Inspect AI viewer to interactively explore results directly from S3:
inspect view --log-dir s3://your-bucket/eval-results/
This command opens a local web interface where you can browse scores, view individual samples, and compare runs.
Download and share
To download results locally:
aws s3 cp s3://your-bucket/eval/output/ ./results/ --recursive INSPECT_LOG_DIR=./results inspect view
VS Code extension
The Inspect AI VS Code extension
-
Install the extension from the VS Code marketplace (search "Inspect AI")
-
In the Inspect Activity Bar, locate the Logs pane and choose the folder icon
-
Enter your S3 path:
s3://your-bucket/eval-results/
Output structure
Each evaluation produces a .eval log file that contains the following sections:
-
results.scores— Aggregate scores for each metric -
samples— Individual evaluation samples with inputs, outputs, and scores -
stats— Runtime statistics including token usage and latency -
eval.config— The configuration used for the evaluation run
View results in MLflow (optional)
If you configured MLflow in your recipe, generate a presigned URL to access the tracking server:
aws sagemaker create-presigned-mlflow-tracking-server-url \ --tracking-server-name my-server \ --region us-west-2
Open the returned URL in your browser to view metrics, compare runs, and analyze trends across evaluations.
Available benchmarks
The Inspect AI container works with any benchmark written in the Inspect AI task format. The inspect-evals repository
To write your own benchmarks, see the Inspect AI task writing documentation
Agentic evaluations
Agentic benchmarks test a model's ability to complete multi-step tasks that require tool use, planning, and iterative reasoning. These evaluations simulate real-world scenarios where the model must call tools, interpret results, and decide on next actions.
Endpoint requirements
Agentic evaluations require endpoints that support the following capabilities:
-
Tool calling — The endpoint must support function calling to enable the model to invoke tools during evaluation
-
Large context size — Multi-turn conversations with tool results require sufficient context length to maintain conversation history
SageMaker Inference endpoint configuration
When using a SageMaker Inference endpoint for agentic evaluations, configure the following environment variables on your endpoint:
| Environment variable | Value | Description |
|---|---|---|
ENABLE_TOOL_CALLING |
True |
Activates tool calling support on the inference endpoint |
CONTEXT_LENGTH |
Sufficient for multi-turn | Set to a value large enough to accommodate multi-turn conversations with tool results |
For information about setting up Amazon Nova endpoints on SageMaker Inference, see Deploy Amazon Nova models on SageMaker. For information about container features and configuration, see Container features.
Amazon Bedrock endpoints
For Amazon Bedrock endpoints, tool calling is natively supported for compatible models. For more information, see Tool use with Amazon Bedrock.
Getting started with agentic evaluations
To run agentic evaluations, complete the following prerequisites:
-
Deploy an endpoint with tool calling enabled
-
Choose an agentic benchmark from the inspect-evals repository
(look for benchmarks that use tool-calling solvers) -
Configure your recipe with appropriate
timeoutandmax_tokensvalues for multi-turn interactions
Amazon Bedrock endpoint
-
For full setup and deployment, see Amazon Bedrock endpoints.
-
For tool calling support, see the client-side tool calling section in Tool use with Amazon Bedrock.
Sample notebooks
The following notebook demonstrates running a tool-calling agentic benchmark with the Inspect AI container:
-
tau-bench (job-based)
— Evaluate tool-augmented reasoning on customer service tasks using the Inspect AI container
For agentic benchmarks that require a Docker sandbox, use the Inspect AI SDK:
-
SWE-bench with Inspect AI SDK
— Evaluate software engineering capabilities using Docker sandbox
Important
Agentic benchmarks that require a Docker sandbox (such as SWE-bench) are not supported in the Inspect AI container experience. The SageMaker Training Job environment does not provide Docker-in-Docker capabilities. To run these benchmarks, use the Inspect AI SDK on a compute environment with Docker access (for example, an Amazon EC2 instance or SageMaker notebook with Docker installed).
Troubleshooting
This section provides solutions for common issues when running evaluations with the Inspect AI container.
Quick iteration tip
Before submitting a SageMaker Training Job, test your benchmarks locally with the Inspect AI SDK. Run inspect eval my_benchmark.py on your local machine to validate task definitions, dependencies, and scoring logic before running at scale.
InsufficientInstanceCapacity error
This error occurs when AWS does not have enough capacity for the requested instance type in your Region.
-
Try a different instance type, or submit the job in another AWS Region
-
Use a different instance type (for example,
ml.m5.xlargeinstead ofml.m5.large) -
Retry the request after a few minutes
AccessDenied error
If the training job fails with an access denied error, verify the following:
-
The role ARN in your job configuration is correct
-
The trust policy allows
sagemaker.amazonaws.com.rproxy.govskope.usto assume the role -
Your user or role has the
iam:PassRolepermission for the execution role -
The execution role has permissions to access the S3 bucket, inference endpoint, or Amazon Bedrock model
Endpoint creation fails
When using cleanup_endpoint: true with automatic endpoint creation, the following issues might occur:
| Error | Solution |
|---|---|
| ResourceLimitExceeded | Request a service quota increase for the inference instance type in your Region |
| OutOfMemoryError | Use a larger inference instance type or reduce model size |
Wrong model_s3_uri |
Verify the S3 path points to a valid model artifact directory |
| Wrong inference image URI | Verify the image URI is correct for your Region and model framework |
| Endpoint stuck in Creating | Check CloudWatch Logs for the endpoint. The model might fail health checks. Increase MaxRuntimeInSeconds if the endpoint needs more time. |
HuggingFace download timeouts
If benchmarks that download datasets from HuggingFace Hub time out, set the HF_HUB_DOWNLOAD_TIMEOUT environment variable to a higher value (in seconds):
--environment '{"HF_HUB_DOWNLOAD_TIMEOUT": "600"}'
Job killed but endpoint still running
If the training job is interrupted before cleanup completes, the inference endpoint might remain active. Manually delete the endpoint to avoid ongoing charges:
# List endpoints to find the orphaned one aws sagemaker list-endpoints \ --name-contains inspect \ --query "Endpoints[].EndpointName" \ --output table # Delete the endpoint aws sagemaker delete-endpoint \ --endpoint-name your-endpoint-name # Delete the endpoint configuration aws sagemaker delete-endpoint-config \ --endpoint-config-name your-endpoint-name # Delete the model aws sagemaker delete-model \ --model-name your-endpoint-name
Benchmark dependency conflicts
If a benchmark fails due to dependency conflicts with pre-installed packages, create a pyproject.toml in the benchmark directory with explicit version constraints. The container installs benchmark dependencies in isolation to minimize conflicts.
Eval scores look wrong
If evaluation scores are unexpectedly low or inconsistent, check the following settings in your recipe:
-
temperature — Set to
0.0for deterministic, reproducible results -
max_tokens — Ensure the value is large enough for the model to complete its response
-
completion_mode — For base (non-chat) models, set
completion_mode: truein your recipe to use completion-style prompting instead of chat format
Data privacy
Your evaluation data is handled differently depending on the inference provider you use.
SageMaker endpoint
When you use a SageMaker Inference endpoint, all data stays within your AWS account. Evaluation prompts and model responses are not sent outside your account and are not used to improve AWS services. No opt-out policy is needed.
Amazon Bedrock
When you use Amazon Bedrock as the inference provider, your data is subject to the AWS AI Services Opt-Out Policy. To prevent your data from being used to improve AWS AI services, enable the opt-out policy at the AWS Organizations level. For more information, see AI services opt-out policies.
| Inference provider | Opt-out required | Details |
|---|---|---|
| SageMaker endpoint | No | Data stays in your account. Not covered by AI opt-out policy. |
| Amazon Bedrock | Yes | Enable the AWS AI Services Opt-Out Policy at the Organizations level to prevent data use for service improvement. |