AGENTOPS06-BP02 Evaluate and track ongoing agent performance
Pre-deployment evaluation validates that an agent is ready to ship. Post-deployment evaluation validates that it still works. Without continuous assessment, gradual quality degradation from data drift, model updates, and shifting user patterns goes unnoticed until it is expensive to fix.
Desired outcome:
-
Agent performance is continually evaluated against defined quality benchmarks.
-
Automated pipelines detect degradation in output quality, reasoning accuracy, and business outcome alignment.
-
Teams have clear visibility into performance trends over time and can correlate quality changes with specific configuration, model, or data updates.
-
Evaluation results drive prioritized improvement actions and provide objective evidence for stakeholder reporting.
Common anti-patterns:
-
Evaluating agent performance only at deployment time without continuous post-deployment assessment, missing gradual degradation from data drift, model updates, or changing user patterns.
-
Relying solely on automated metrics without periodic human evaluation, missing quality dimensions that automated metrics can't fully capture (like nuance, appropriateness, and business context alignment).
-
Using generic evaluation criteria across all agents without tailoring metrics to each agent's specific use case and business objectives, producing evaluation results that don't reflect actual value.
-
Treating evaluation as separate from operations rather than integrating it into the operational workflow, creating evaluation debt that accumulates over time.
Benefits of establishing this best practice:
-
Continuous evaluation provides an empirical foundation for evidence-based improvement, identifying which agents need attention and which changes produce measurable gains.
-
Performance trend tracking reveals patterns that inform systematic improvement, turning evaluation data into practical insights.
-
Multi-dimensional scoring catches quality issues that a single metric would miss.
-
Correlation between quality shifts and configuration changes compresses root-cause analysis.
Level of risk exposed if this best practice is not established: High
Implementation guidance
Amazon Bedrock AgentCore Evaluations is an evaluation service for continuous assessment. Its on-demand mode runs benchmarks during development, and its online mode samples and evaluates live interactions in production without requiring manual triggers. Thirteen built-in evaluators cover correctness, helpfulness, safety, and tool selection accuracy, with custom evaluators available for business-specific requirements. Amazon Bedrock Evaluations supplements this with model-level assessment, and periodic human evaluation covers the dimensions automated metrics miss.
Evaluation frameworks need multiple dimensions because a single metric misses too much. For example:
-
Output quality (relevance, accuracy, coherence) measures whether responses are good.
-
Safety (hallucination rate, toxicity, guardrail adherence) measures whether responses are safe.
-
Efficiency (task completion rate, tool invocation success) measures whether the agent is economical.
-
Business alignment (outcome achievement, user satisfaction, SLA compliance) measures whether the agent delivers value.
Weighting depends on the use case. For instance, a customer-support agent might weigh satisfaction higher than efficiency, while an internal automation agent might weigh efficiency higher than relevance. Generic weighting produces generic results.
Dashboards that show evaluation scores over time make degradation visible before it becomes an incident. Alerting on threshold violations and on persistent negative trends, as opposed to single-point dips, catches the slow-moving problems that are hardest to diagnose after the fact. Correlate evaluation shifts with configuration and model changes so attribution is fast when a metric moves.
LLM-as-a-Judge patterns can use multiple evaluator prompts covering different quality dimensions to produce a composite score that is more reliable than any single prompt. Periodic human review validates the automated scores and catches the blind spots.
Implementation steps
-
Configure Amazon Bedrock AgentCore Evaluations: Use on-demand mode for development benchmarking and online mode for continuous production monitoring.
-
Define a multi-dimensional evaluation framework: Apply use-case-specific weighting across quality, safety, efficiency, and business alignment.
-
Implement LLM-as-judge patterns: Use multiple evaluator prompts and supplement with periodic human evaluation.
-
Build evaluation dashboards: Show trends over time with alerting for threshold violations and persistent negative trends.
-
Correlate evaluation results with change events: Tag deployments, configuration updates, and model changes so quality shifts can be attributed quickly.
Resources
Related best practices:
Related documents:
Related workshops:
Related services: