AGENTOPS07-BP01 Implement automated response and recovery mechanisms
Agents that recover from failure without human intervention keep service running and give the team the failure data they need to help prevent recurrence. Manual-only recovery scales poorly and can turn routine degradations into incidents.
Desired outcome:
-
Agent systems detect and recover from common failure scenarios automatically, maintaining service availability and user experience continuity.
-
Automatic cutoffs help prevent cascading failures from propagating across the agent environment.
-
Fallback strategies keep agents degrading gracefully rather than failing completely when primary capabilities are unavailable.
-
Recovery time objectives are defined, met, and tested regularly.
Common anti-patterns:
-
Implementing retry logic without automatic cutoffs, causing agents to repeatedly invoke failing services and amplifying load on degraded systems rather than failing fast.
-
Designing fallback strategies that silently degrade quality without notifying users, producing a poor experience where users receive low-quality responses without understanding why.
-
Failing to define recovery time objectives for different failure scenarios, making it impossible to assess whether recovery mechanisms meet operational requirements.
-
Implementing recovery mechanisms that work in isolation but fail in combination, missing failure scenarios where multiple components degrade simultaneously.
-
Never testing recovery procedures under realistic failure conditions, discovering problems only during actual production incidents.
Benefits of establishing this best practice:
-
Automated recovery captures detailed failure data that drives systematic improvement of agent resilience, letting teams address root causes rather than repeatedly responding to the same incidents.
-
Self-healing capabilities adapt to different failure contexts, transient tool unavailability, complete service outages, and model degradation, with recovery strategies proportional to severity.
-
Automatic cutoffs break cascading failures at the source rather than allowing them to propagate.
-
Chaos engineering exercises validate that recovery works under realistic conditions, not just in theory.
Level of risk exposed if this best practice is not established: High
Implementation guidance
Establish automatic cutoffs for every external dependency. Store
state for the cutoff (healthy, degraded, and open) in a fast data
store where every agent can read it. Thresholds depend on each
dependency's reliability characteristics. Set an error rate
threshold (for example, 50% errors in a 60-second window), a
timeout threshold (for example, 5 consecutive timeouts), and a
recovery probe interval (for example, attempt recovery every 30
seconds). Emit
Amazon CloudWatch
Fallback strategies need to be designed for each capability, not copied from a template. Tool failures get fallback chains: alternative tools with equivalent capabilities, then graceful degradation, and then manual escalation. LLM inference failures get model fallback chains that route to alternative models (Claude 3.5 Sonnet to Claude 3 Haiku, for example) when the primary is unavailable. Multi-agent coordination failures get single-agent fallback modes that handle tasks with reduced capability rather than failing completely. Each fallback should notify users when quality is degraded, not silently return a worse answer.
AWS Step Functions
Monitoring actual recovery times against objectives tells the team whether the mechanisms actually meet operational requirements. Quarterly chaos engineering exercises validate that recovery works under realistic conditions rather than just in the happy-path scenarios the original design anticipated. For reliability-focused automated recovery with classify-route-escalate patterns, see AGENTREL07-BP02 Enable automatic recovery from agent execution failures.
Implementation steps
-
Implement automatic cutoffs for every external dependency: Define thresholds for error rate, timeout count, and recovery probing per dependency.
-
Design fallback chains per agent capability: Specify alternative tools, models, and degraded-mode operations, and notify users when quality is degraded.
-
Build durable recovery workflows: Use AWS Step Functions or equivalent with error handling, retry logic, and compensating transactions.
-
Configure health check endpoints: Verify dependency availability and report overall health status for each agent.
-
Define RTOs per failure scenario: Monitor actual recovery times against objectives in Amazon CloudWatch.
-
Run quarterly chaos engineering exercises: Inject failures in non-production environments to validate recovery mechanisms under realistic conditions.
Resources
Related best practices:
Related documents:
Related services: