Ground truth evaluations
Ground truth is the known correct answer or expected behavior for a given input — the "gold standard" you compare actual results against. For agent evaluation, ground truth transforms subjective quality assessment into objective measurement, enabling regression detection, benchmark datasets, and domain-specific correctness that generic evaluators cannot provide on their own.
With ground truth evaluations, you provide reference inputs alongside your session spans when calling the Evaluate API. The service uses these reference inputs to score your agent’s actual behavior against the expected behavior. Evaluators that don’t use a particular ground truth field ignore it and report which fields were not used in the response.
Topics
Supported builtin evaluators and ground truth fields
The following table shows which built-in evaluators support ground truth and which fields they use.
| Evaluator | Level | Ground truth field | Description |
|---|---|---|---|
|
|
Trace |
|
Measures how accurately the agent’s response matches the expected answer. Uses LLM-as-a-Judge scoring. |
|
|
Session |
|
Validates whether the agent’s behavior satisfies natural language assertions across the entire session. Uses LLM-as-a-Judge scoring. |
|
|
Session |
|
Checks that the actual tool call sequence matches the expected sequence exactly — same tools, same order, no extras. Programmatic scoring (no LLM calls). |
|
|
Session |
|
Checks that all expected tools appear in order within the actual sequence, but allows extra tools between them. Programmatic scoring. |
|
|
Session |
|
Checks that all expected tools are present in the actual sequence, regardless of order. Extra tools are allowed. Programmatic scoring. |
Note
Custom evaluators also support ground truth fields through placeholders in their evaluation instructions. See Ground truth in custom evaluators for details.
The following table describes the ground truth fields.
| Field | Type | Scope | Description |
|---|---|---|---|
|
|
String |
Trace |
The expected agent response for a specific turn. Scoped to a trace using |
|
|
List of strings |
Session |
Natural language statements that should be true about the agent’s behavior across the session. |
|
|
List of tool names |
Session |
The expected sequence of tool calls for the session. |
-
Ground truth fields are optional. If you omit them, evaluators fall back to their ground truth-free mode (for example,
Builtin.Correctnessstill works withoutexpectedResponse, it just evaluates based on context alone). -
You can provide all ground truth fields in a single request. The service picks the relevant fields for each evaluator and reports
ignoredReferenceInputFieldsin the response for any fields that were not used. -
You don’t need to provide
expectedResponsefor every trace. Traces without ground truth are evaluated using the ground truth-free variant of the evaluator.
Prerequisites
-
Python 3.10+
-
An agent deployed on AgentCore Runtime with observability enabled, or an agent built with a supported framework configured with AgentCore Observability . Supported frameworks:
-
Strands Agents
-
LangGraph with
opentelemetry-instrumentation-langchainoropeninference-instrumentation-langchain
-
-
Transaction Search enabled in CloudWatch — see Enable Transaction Search
-
AWS credentials configured with permissions for
bedrock-agentcore,bedrock-agentcore-control, andlogs(CloudWatch)
For instructions on downloading session spans, see Getting started with on-demand evaluation.
About the examples
The examples on this page use the sample agent from the AgentCore Evaluations tutorialscalculator and weather — and is deployed on AgentCore Runtime with observability enabled.
The examples assume a two-turn session:
-
Turn 1: "What is 15 + 27?" — agent uses the
calculatortool and responds with the result. -
Turn 2: "What’s the weather?" — agent uses the
weathertool and responds with the current weather.
Before running evaluations, invoke your agent and wait 2–5 minutes for CloudWatch to ingest the telemetry data.
The following constants are used throughout the examples on this page. Replace them with your own values:
REGION = "<region-code>" AGENT_ID = "my-agent-id" SESSION_ID = "my-session-id" TRACE_ID_1 = "<trace-id-1>" # Turn 1: "What is 15 + 27?" TRACE_ID_2 = "<trace-id-2>" # Turn 2: "What's the weather?"
Correctness with expected response
Builtin.Correctness is a trace-level evaluator that measures how accurately the agent’s response matches an expected answer. When you provide expectedResponse , the evaluator compares the agent’s actual response against your ground truth using LLM-as-a-Judge scoring.
Example
GoalSuccessRate with assertions
Builtin.GoalSuccessRate is a session-level evaluator that validates whether the agent’s behavior satisfies a set of natural language assertions. Assertions can check tool usage, response content, ordering of actions, or any other observable behavior across the entire conversation.
Note
The examples below use assertions that validate tool usage, but assertions are free-form natural language — you can use them to assert on any aspect of agent behavior, such as response tone, factual accuracy, safety compliance, or business logic.
Example
Trajectory matching with expected trajectory
The trajectory evaluators compare the agent’s actual tool call sequence against an expected sequence of tool names. Three variants are available, each with different matching strictness. All three are session-level evaluators and use programmatic scoring (no LLM calls, so token usage is zero).
| Evaluator | Matching rule | Example |
|---|---|---|
|
|
Actual must match expected exactly — same tools, same order, no extras |
Expected: |
|
|
Expected tools must appear in order, but extra tools are allowed between them |
Expected: |
|
|
All expected tools must be present, order doesn’t matter, extras allowed |
Expected: |
Example
Combining all ground truth fields in one request
You can pass all ground truth fields together in a single evaluation call. The service routes each field to the appropriate evaluator and ignores fields that a given evaluator doesn’t use. This means you can construct your reference inputs once and reuse them across different evaluators without modifying the payload.
Example
Understanding ignored reference input fields
When you provide ground truth fields that an evaluator doesn’t use, the response includes an ignoredReferenceInputFields array listing the unused fields. This is informational, not an error — the evaluation still completes successfully.
For example, if you call Builtin.Helpfulness with expectedResponse provided, the evaluator ignores the ground truth (Helpfulness doesn’t use it) and returns:
{ "evaluatorId": "Builtin.Helpfulness", "value": 0.83, "label": "Very Helpful", "explanation": "...", "ignoredReferenceInputFields": ["expectedResponse"] }
This behavior is by design — it allows you to construct a single set of reference inputs and use them across multiple evaluators without adjusting the payload for each one.
Ground truth in custom evaluators
Custom evaluators can use ground truth fields through placeholders in their evaluation instructions. When you create a custom evaluator, you can reference the following placeholders:
-
Session-level custom evaluators:
{context},{available_tools},{actual_tool_trajectory},{expected_tool_trajectory},{assertions} -
Trace-level custom evaluators:
{context},{assistant_turn},{expected_response}
For example, a custom trace-level evaluator that checks response similarity might use:
Compare the agent's response with the expected response. Agent response: {assistant_turn} Expected response: {expected_response} Rate how closely the agent's response matches the expected response on a scale of 0 to 1.
When this evaluator is called with expectedResponse in the reference inputs, the service substitutes the placeholder with the actual ground truth value before scoring.
For details on creating custom evaluators, see Custom evaluators.
Note
Custom evaluators that use ground truth placeholders ( {assertions} , {expected_response} , {expected_tool_trajectory} ) cannot be used in online evaluation configurations, because online evaluations monitor live production traffic where ground truth values are not available.