Design principles

In addition to the lens-level design principles, the reliability best practices in this lens are represented by at least one of the following principles:

Decouple agents through durable messaging: Persistence, retry, and dead-letter handling absorb transient failures inside the messaging layer instead of cascading them through synchronous call chains.
Constrain blast radius through atomic responsibilities: Single-task agents with the minimum permissions and clear instructions limit how far any individual failure or misbehavior can propagate.
Recover from the last known good state, not the beginning: Checkpointed workflows, idempotent steps, and graceful degradation let work resume after a fault rather than restart from scratch.
Make multi-agent coordination resilient by design: Arbiter patterns, capability taxonomies, and fallback paths keep collaborative workflows running when individual agents become unavailable or unreliable.
Ground reasoning in verifiable evidence: Retrieval from authoritative sources, explicit citation, and hallucination detection keep agent outputs traceable to real data instead of fabricated content.
Exercise failure paths regularly: Inject faults, run degraded-dependency tests, and rehearse recovery procedures. The first time a failure mode occurs should not be in production.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Reliability

Predictable agent behavior