Preparing evaluation data Launching an evaluation job Evaluation hyperparameters Monitoring evaluation Interpreting evaluation results Limits & quotas for evaluation

Model evaluation

Evaluation runs your agent against a prompt set and reports reward, pass@k, and trajectory metrics. Use it to compare a fine-tuned model against its base or benchmark a candidate before deployment.

Note: Evaluation jobs use JobCategory: AgentRFTEvaluation (not AgentRFT as used in training). Use this category on CreateJob and DescribeJob calls for eval jobs.

Preparing evaluation data

Evaluation datasets use the same format and schema as training datasets. See Prompt dataset format. The guidance below is eval-specific.

Hold out evaluation data from training datasets. Never evaluate on training prompts as scores will overstate performance. Reserve a portion of the data for a held-out set, or maintain a separate eval dataset. Keep the same eval set across iterations so results are comparable.
Match the training prompt format. The agent must parse eval prompts the same way it parsed training prompts. If you used encoding or encryption during training, use the identical structure here. Generating both from the same code path avoids drift.
Cover the behaviors you care about. Exercise each tool and tool combination your agent uses. Include prompts that previously caused failures so regressions surface.
Protect sensitive content the same way as in training as the service passes prompts through without inspection.

Launching an evaluation job

Create an evaluation job (Bedrock AgentCore)

To create an evaluation job, call the CreateJob API with JobCategory set to AgentRFTEvaluation. Agent setup and configuration follows the same process as training jobs.

Using AWS CLI



aws sagemaker create-job \
  --job-category AgentRFTEvaluation \
  --job-name "my-agent-rft-eval-job" \
  --role-arn "arn:aws:iam::123456789012:role/SageMakerFineTuningJobRole" \
  --job-config-schema-version "1.0.0" \
  --job-config-document '{
    "AgentConfig": {
      "BedrockAgentCoreConfig": {
        "AgentRuntimeArn": "arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-agent"
      }
    },
    "InputDataConfig": [{
      "ChannelName": "evaluation",
      "DataSource": {
        "S3DataSource": {
          "S3DataType": "S3Prefix",
          "S3Uri": "s3://your-bucket-name/eval-prompts/"
        }
      }
    }],
    "OutputDataConfig": {
      "S3OutputPath": "s3://your-bucket-name/eval-output/",
      "MlflowConfig": {
        "MlflowResourceArn": "arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/my-rft-mlflow-app"
      }
    },
    "EvaluationConfig": {
      "BaseModelArn": "arn:aws:sagemaker:us-west-2:aws:hub-content/SageMakerPublicHub/Model/openai-reasoning-gpt-oss-20b",
      "AcceptEula": true,
      "HyperParameters": {"batch": { "eval_group_size": 1 }, "eval_metrics_config": { "pass_k_values": [1, 2, 4, 8, 16, 32], "success_threshold": 1 }}
    }
  }' \
  --region us-west-2

Using SageMaker AI Python SDK (boto3)



import json
import boto3

sm = boto3.client("sagemaker")

response = sm.create_job(
    JobName="my-agent-rft-eval-job",
    RoleArn="arn:aws:iam::123456789012:role/SageMakerFineTuningJobRole",
    JobCategory="AgentRFTEvaluation",
    JobConfigSchemaVersion="1.0.0",
    JobConfigDocument=json.dumps({
        "AgentConfig": {
            "BedrockAgentCoreConfig": {
                "AgentRuntimeArn": "arn:aws:bedrock-agentcore:us-west-2:123456789012:runtime/my-agent"
            }
        },
        "InputDataConfig": [{
            "ChannelName": "evaluation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://your-bucket-name/eval-prompts/"
                }
            }
        }],
        "OutputDataConfig": {
            "S3OutputPath": "s3://your-bucket-name/eval-output/",
            "MlflowConfig": {
                "MlflowResourceArn": "arn:aws:sagemaker:us-west-2:123456789012:mlflow-app/my-rft-mlflow-app"
            }
        },
        "EvaluationConfig": {
            "BaseModelArn": "arn:aws:sagemaker:us-west-2:aws:hub-content/SageMakerPublicHub/Model/openai-reasoning-gpt-oss-20b",
            "AcceptEula": True,
            "HyperParameters": {"batch": { "eval_group_size": 1 }, "eval_metrics_config": { "pass_k_values": [1, 2, 4, 8, 16, 32], "success_threshold": 1 }}
        }
    })
)

print(f"Eval Job ARN: {response['JobArn']}")

Evaluating a fine-tuned model

To evaluate a model produced by a training job, include ModelPackageConfig with the InputModelPackageArn:



import json
import boto3

sm = boto3.client("sagemaker")

response = sm.create_job(
    JobName="my-agent-rft-eval-finetuned",
    RoleArn="arn:aws:iam::123456789012:role/SageMakerFineTuningJobRole",
    JobCategory="AgentRFTEvaluation",
    JobConfigSchemaVersion="1.0.0",
    JobConfigDocument=json.dumps({
        "AgentConfig": {...},
        "InputDataConfig": [...],
        "OutputDataConfig": {...},
        "ModelPackageConfig": {
            "InputModelPackageArn": "arn:aws:sagemaker:us-west-2:123456789012:model-package/my-final-models/1"
        },
        "EvaluationConfig": {...}
    })
)

print(f"Eval Job ARN: {response['JobArn']}")

Create an evaluation job (Custom Agent with Lambda Forwarder)

Use the same approach as Bedrock AgentCore evaluation but specify CustomAgentLambdaConfig in the AgentConfig:



"AgentConfig": {
    "CustomAgentLambdaConfig": {
        "LambdaArn": "arn:aws:lambda:us-west-2:account-id:function:rft-agent-forwarder"
    }
}

Evaluation hyperparameters

Category	Parameter	Type	Default	Description
batch	eval_group_size	integer	1	Rollouts per prompt. Note: To compute pass@k, set eval_group_size >= k.
eval_metrics_config	pass_k_values	array	[1, 2, 4, 8, 16, 32]	List of k values for computing pass@k and pass^k metrics. pass@k = probability that at least 1 of k sampled rollouts succeeds. pass^k = probability that all k rollouts succeed.
eval_metrics_config	success_threshold	float	1	A rollout is "successful" when reward >= success_threshold. Note this applies for pass@k, pass^k, succeeded/failed count metrics.
eval_sampling_params	temperature	float	0	Sampling temperature for evaluation rollouts. Note: it is recommended to increase this value beyond 0 for pass@k and pass^k metrics to avoid deterministic behavior.
eval_sampling_params	sampling_top_p	float	1	Nucleus sampling cutoff for evaluation rollouts.
eval_sampling_params	sampling_max_tokens	integer	4096	Maximum tokens the model can generate per turn during evaluation rollouts.
rollout	timeout	float	600	Time (seconds) after which an evaluation rollout is treated as failed and may be retried.
rollout	max_concurrency	int	96	Maximum number of evaluation rollouts that can execute in parallel.
rollout	max_retries	int	3	Number of retry attempts for failed evaluation rollouts before marking them as permanently failed.

Monitoring evaluation



aws sagemaker describe-job \
  --job-name "my-agent-rft-eval-job" \
  --job-category AgentRFTEvaluation \
  --region us-west-2

Interpreting evaluation results

Open your MLflow App to view logged metrics, reward distributions, and trajectory visualizations for the evaluation run. See below for explanation on what each of the metrics means.

Reward metrics (`eval/reward/`)

Metric	Description
`eval/reward/mean`	Average reward score across all rollouts.
`eval/reward/min`	Minimum reward score across all rollouts.
`eval/reward/max`	Maximum reward score across all rollouts.
`eval/reward/std`	Standard deviation of reward scores across all rollouts.
`eval/reward/zero_frac`	Fraction of rollouts that scored exactly 0 (complete failures).
`eval/reward/pass_at_1`	Probability that at least 1 out of 1 sample per prompt succeeds. This is the primary success metric.
`eval/reward/pass_power_1`	Pass rate with power weighting.
`eval/reward/succeeded_rollouts`	Total number of rollouts that achieved a positive reward.
`eval/reward/failed_rollouts`	Total number of rollouts that scored 0.
`eval/reward/num_prompts`	Number of distinct prompts evaluated.
`eval/reward/rollouts_per_prompt`	Number of attempts (samples) generated per prompt.
`eval/reward/success_threshold`	The reward value required to count a rollout as "successful".
`eval/reward/mean_within_groups`	Average reward per prompt group (requires setting rollouts_per_prompt > 1).
`eval/reward/std_within_groups`	Standard deviation of reward within each prompt group.
`eval/reward/min_within_groups`	Minimum reward within each prompt group.
`eval/reward/max_within_groups`	Maximum reward within each prompt group.

Token metrics (`eval/tokens/`)

Metric	Description
`eval/tokens/prompt_mean`	Average prompt length in tokens (includes system prompt, tool descriptions, and multi-turn context).
`eval/tokens/response_mean`	Average model response length in tokens per turn.
`eval/tokens/response_min`	Shortest model response in tokens.
`eval/tokens/response_max`	Longest model response in tokens.
`eval/tokens/response_std`	Standard deviation of response lengths. High variance may indicate inconsistent agent behavior.

Turn metrics (`eval/turns/`)

Metric	Description
`eval/turns/mean`	Average number of turns per rollout. High values may indicate the agent is looping or retrying excessively.
`eval/turns/min`	Fewest turns in any rollout.
`eval/turns/max`	Most turns in any rollout. Very high values suggest the agent got stuck without solving the task.

Log probability metrics (`eval/logprob/`)

Metric	Description
`eval/logprob/nz_mean`	Mean log-probability of non-zero (non-padding) tokens. Values close to 0 indicate high model confidence.
`eval/logprob/nz_min`	Lowest log-probability token (least confident prediction).
`eval/logprob/nz_max`	Highest log-probability token (most confident prediction).
`eval/logprob/nz_std`	Standard deviation of non-zero log-probabilities. Low values mean consistently high confidence.
`eval/logprob/zero_frac`	Fraction of tokens with exactly zero log-probability (probability 1.0), typically padding or forced tokens.
`eval/logprob/zero_count`	Total count of zero-logprob tokens.
`eval/logprob/zero_per_group`	Average zero-logprob tokens per prompt group.

Timing metrics (`timing_s/`)

Metric	Description
`timing_s/eval`	Total wall-clock time for the evaluation in seconds. Divide by `eval/reward/num_prompts` for average time per prompt.

How to diagnose common patterns

Pattern	Likely cause
`pass_at_1 = 0`, high `turns/mean`	Agent is looping without solving tasks. Check tool usage and action patterns in trajectories.
`pass_at_1 = 0`, low `tokens/response_mean`	Agent producing very short (possibly empty or malformed) responses. Check prompt format and model compatibility.
High `turns/max` with low `turns/min`	Agent is showing inconsistent behavior across prompts. Some tasks may be much harder or the agent may be failing on specific tool interactions.
High confidence (`logprob/nz_mean` close to 0) but low reward	Model is confidently producing wrong outputs. It may need more training data diversity or reward signal refinement.
`zero_frac = 1.0` in reward	Complete failure. Verify agent deployment, tool connectivity, and that the evaluation dataset format is correct.

For raw results, review the artifacts written to the S3OutputPath specified in OutputDataConfig.

Limits & quotas for evaluation

Use AWS service quotas to request a limit increase on the maximum number of concurrent evaluation jobs.

Quota	Default
rft-evaluation-job maximum concurrent jobs	1

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Training job submission

Model deployment