Capability intent Maturity levels Common issues to watch for

Agent monitoring, management and recovery

Agent systems that decompose workflows into recoverable stages, classify failures for targeted retry, and implement end-to-end distributed tracing recover smoothly from the failures that occur across their many components.

AGENTREL07: How do fault tolerant agent systems recover?

Capability intent

Agent workflows are decomposed into stages with persisted outputs and explicit validation between them, so failures are contained to the affected stage rather than cascading through the whole workflow.
Failures are classified before any recovery action is taken, so retries apply to transient errors, fallbacks apply to persistent ones, and only genuinely unrecoverable failures reach human attention.
Retries use exponential backoff with jitter and a retry budget, so widespread upstream failures don't produce unbounded retry storms.
Self-healing responses address common failure patterns directly (prompt refinement, tool substitution, context reconstruction) before escalating to fallbacks.
End-to-end distributed tracing covers every agent invocation, with agent-specific annotations and framework-level reasoning steps captured in the same trace view.
Traces, metrics, and logs are correlated in unified dashboards, so operators diagnose incidents from evidence rather than by pivoting between tools.

Maturity levels

These levels summarize what each stage of maturity looks like for agent monitoring, management and recovery as a whole.

Level	Name	What it looks like
1	Initial	Agent workflows run as monolithic processes, so any failure causes a complete restart and lost work. Retries are applied uniformly, including to non-retryable errors, and without exponential backoff or jitter. Tracing is limited to the application boundary, and incident responders reconstruct the execution path from logs. Recovery actions are manual.
2	Emerging	A few long-running workflows are split into stages, and retries use exponential backoff for a limited set of known-transient errors. Amazon CloudWatch alarms notify operators on elevated error rates. Tracing is added for critical paths but doesn't propagate across queue boundaries. Failure classification exists in one-time code rather than a shared library.
3	Defined	Workflows are orchestrated in AWS Step Functions with state persistence at every transition and redrive enabled for incremental recovery. Failure classification is standardized in a shared library, and retries use exponential backoff with full jitter. Amazon Bedrock AgentCore Observability captures end-to-end traces across agent components, and Amazon Bedrock AgentCore Policy enforces retry budgets. Stage-level validation helps prevent errors from propagating through later stages.
4	Proactive	Self-healing responses address common failure patterns directly (prompt refinement, tool substitution, context reconstruction) before fallbacks engage. Fallback strategies route persistent failures to degraded responses, cached answers, or human review queues rather than returning errors. Distributed tracing includes agent-specific annotations and Strands Agents reasoning steps, and trace context propagates through every communication path including asynchronous messages. Amazon Bedrock AgentCore Evaluations validate that recovered outputs match pre-failure quality baselines.
5	Optimized	Recovery strategies are continuously refined based on recovery metrics and trace analysis. Fallback paths are tested alongside primary implementations, and stage decomposition is driven by observed failure and timeout patterns. Trace-based alerting fires on anomalies before they become incidents, and correlated dashboards turn traces, metrics, and logs into a single investigative surface. The organization publishes internal recovery patterns and shares lessons across teams.

Common issues to watch for

Long-running workflows run as monolithic processes, so a failure late in the workflow loses all progress and forces complete restart.
Retry logic is applied uniformly to every failure, retrying non-retryable errors that will never succeed and delaying the human intervention that would resolve them.
Retries run at fixed intervals without exponential backoff or jitter, producing thundering-herd effects that amplify incidents rather than recovering from them.
Only retry-based recovery is implemented, so persistent failures after retries become user-visible errors instead of degrading through a fallback.
Tracing stops at the application boundary or at queue boundaries, so debugging failures in multi-agent systems depends on manual log correlation across services.
Traces are captured without agent-specific annotations, so trace queries can't be filtered down to the specific agent, task type, or model that failed.

Best practices

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

AGENTREL06-BP05 Implement dynamic capability toggling

AGENTREL07-BP01 Design workflows in stages with incremental recovery