Reliability

When agents reason, plan, and act through large language models, reliability takes on dimensions that traditional infrastructure patterns don't address. LLM decisions are stochastic, multi-agent coordination introduces new failure modes, and memory integrity becomes a first-class concern. An agent that works perfectly in testing may behave unpredictably in production when context windows fill up, models return unexpected outputs, or downstream agents become unavailable. This pillar provides best practices for building agent systems that execute tasks predictably, recover from failures automatically, and maintain partial functionality even under adverse conditions.

Capabilities

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

AGENTSEC09-BP05 Implement runtime threat detection, security event correlation, and automated remediation for agents

Design principles