# Workload architecture
<a name="workload-architecture"></a>

 A reliable workload starts with upfront design decisions for both software and infrastructure. Your architecture choices will impact your workload behavior across all six Well-Architected pillars. For reliability, there are specific patterns you must follow. 

 The following sections explain best practices to use with these patterns for reliability.

**Topics**
+ [

# Design your workload service architecture
](design-your-workload-service-architecture.md)
+ [

# Design interactions in a distributed system to prevent failures
](design-interactions-in-a-distributed-system-to-prevent-failures.md)
+ [

# Design interactions in a distributed system to mitigate or withstand failures
](design-interactions-in-a-distributed-system-to-mitigate-or-withstand-failures.md)

# Design your workload service architecture
<a name="design-your-workload-service-architecture"></a>

 Build highly scalable and reliable workloads using a service-oriented architecture (SOA) or a microservices architecture. Service-oriented architecture (SOA) is the practice of making software components reusable via service interfaces. Microservices architecture goes further to make components smaller and simpler. 

 Service-oriented architecture (SOA) interfaces use common communication standards so that they can be rapidly incorporated into new workloads. SOA replaced the practice of building monolith architectures, which consisted of interdependent, indivisible units. 

 At AWS, we have always used SOA, but have now embraced building our systems using microservices. While microservices have several attractive qualities, the most important benefit for availability is that microservices are smaller and simpler. They allow you to differentiate the availability required of different services, and thereby focus investments more specifically to the microservices that have the greatest availability needs. For example, to deliver product information pages on Amazon.com (“detail pages”), hundreds of microservices are invoked to build discrete portions of the page. While there are a few services that must be available to provide the price and the product details, the vast majority of content on the page can simply be excluded if the service isn’t available. Even such things as photos and reviews are not required to provide an experience where a customer can buy a product. 

**Topics**
+ [

# REL03-BP01 Choose how to segment your workload
](rel_service_architecture_monolith_soa_microservice.md)
+ [

# REL03-BP02 Build services focused on specific business domains and functionality
](rel_service_architecture_business_domains.md)
+ [

# REL03-BP03 Provide service contracts per API
](rel_service_architecture_api_contracts.md)

# REL03-BP01 Choose how to segment your workload
<a name="rel_service_architecture_monolith_soa_microservice"></a>

 Workload segmentation is important when determining the resilience requirements of your application. Monolithic architecture should be avoided whenever possible. Instead, carefully consider which application components can be broken out into microservices. Depending on your application requirements, this may end up being a combination of a service-oriented architecture (SOA) with microservices where possible. Workloads that are capable of statelessness are more capable of being deployed as microservices. 

 **Desired outcome:** Workloads should be supportable, scalable, and as loosely coupled as possible. 

 When making choices about how to segment your workload, balance the benefits against the complexities. What is right for a new product racing to first launch is different than what a workload built to scale from the start needs. When refactoring an existing monolith, you will need to consider how well the application will support a decomposition towards statelessness. Breaking services into smaller pieces allows small, well-defined teams to develop and manage them. However, smaller services can introduce complexities which include possible increased latency, more complex debugging, and increased operational burden. 

 **Common anti-patterns:** 
+  The [microservice *Death Star*](https://mrtortoise.github.io/architecture/lean/design/patterns/ddd/2018/03/18/deathstar-architecture.html) is a situation in which the atomic components become so highly interdependent that a failure of one results in a much larger failure, making the components as rigid and fragile as a monolith. 

 **Benefits of establishing this practice:** 
+  More specific segments lead to greater agility, organizational flexibility, and scalability. 
+  Reduced impact of service interruptions. 
+  Application components may have different availability requirements, which can be supported by a more atomic segmentation. 
+  Well-defined responsibilities for teams supporting the workload. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Choose your architecture type based on how you will segment your workload. Choose an SOA or microservices architecture (or in some rare cases, a monolithic architecture). Even if you choose to start with a monolith architecture, you must ensure that it’s modular and can ultimately evolve to SOA or microservices as your product scales with user adoption. SOA and microservices offer respectively smaller segmentation, which is preferred as a modern scalable and reliable architecture, but there are trade-offs to consider, especially when deploying a microservice architecture. 

 One primary trade-off is that you now have a distributed compute architecture that can make it harder to achieve user latency requirements and there is additional complexity in the debugging and tracing of user interactions. You can use AWS X-Ray to assist you in solving this problem. Another effect to consider is increased operational complexity as you increase the number of applications that you are managing, which requires the deployment of multiple independency components. 

![\[Diagram showing a comparison between monolithic, service-oriented, and microservices architectures\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/monolith-soa-microservices-comparison.png)


## Implementation steps
<a name="implementation-steps"></a>
+  Determine the appropriate architecture to refactor or build your application. SOA and microservices offer respectively smaller segmentation, which is preferred as a modern scalable and reliable architecture. SOA can be a good compromise for achieving smaller segmentation while avoiding some of the complexities of microservices. For more details, see [Microservice Trade-Offs](https://martinfowler.com/articles/microservice-trade-offs.html). 
+  If your workload is amenable to it, and your organization can support it, you should use a microservices architecture to achieve the best agility and reliability. For more details, see [Implementing Microservices on AWS.](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/introduction.html) 
+  Consider following the [*Strangler Fig* pattern](https://martinfowler.com/bliki/StranglerFigApplication.html) to refactor a monolith into smaller components. This involves gradually replacing specific application components with new applications and services. [AWS Migration Hub Refactor Spaces](https://docs.aws.amazon.com/migrationhub-refactor-spaces/latest/userguide/what-is-mhub-refactor-spaces.html) acts as the starting point for incremental refactoring. For more details, see [Seamlessly migrate on-premises legacy workloads using a strangler pattern](https://aws.amazon.com/blogs/architecture/seamlessly-migrate-on-premises-legacy-workloads-using-a-strangler-pattern/). 
+  Implementing microservices may require a service discovery mechanism to allow these distributed services to communicate with each other. [AWS App Mesh](https://docs.aws.amazon.com/app-mesh/latest/userguide/what-is-app-mesh.html) can be used with service-oriented architectures to provide reliable discovery and access of services. [AWS Cloud Map](https://aws.amazon.com/cloud-map/) can also be used for dynamic, DNS-based service discovery. 
+  If you’re migrating from a monolith to SOA, [Amazon MQ](https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/welcome.html) can help bridge the gap as a service bus when redesigning legacy applications in the cloud.
+  For existing monoliths with a single, shared database, choose how to reorganize the data into smaller segments. This could be by business unit, access pattern, or data structure. At this point in the refactoring process, you should choose to move forward with a relational or non-relational (NoSQL) type of database. For more details, see [From SQL to NoSQL](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SQLtoNoSQL.html). 

 **Level of effort for the implementation plan:** High 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL03-BP02 Build services focused on specific business domains and functionality](rel_service_architecture_business_domains.md) 

 **Related documents:** 
+  [Amazon API Gateway: Configuring a REST API Using OpenAPI](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-import-api.html) 
+  [What is Service-Oriented Architecture?](https://aws.amazon.com/what-is/service-oriented-architecture/) 
+  [Bounded Context (a central pattern in Domain-Driven Design)](https://martinfowler.com/bliki/BoundedContext.html) 
+  [Implementing Microservices on AWS](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/introduction.html) 
+  [Microservice Trade-Offs](https://martinfowler.com/articles/microservice-trade-offs.html) 
+  [Microservices - a definition of this new architectural term](https://www.martinfowler.com/articles/microservices.html) 
+  [Microservices on AWS](https://aws.amazon.com/microservices/) 
+  [What is AWS App Mesh?](https://docs.aws.amazon.com/app-mesh/latest/userguide/what-is-app-mesh.html) 

 **Related examples:** 
+  [Iterative App Modernization Workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/f2c0706c-7192-495f-853c-fd3341db265a/en-US/intro) 

 **Related videos:** 
+  [Delivering Excellence with Microservices on AWS](https://www.youtube.com/watch?v=otADkIyugzY) 

# REL03-BP02 Build services focused on specific business domains and functionality
<a name="rel_service_architecture_business_domains"></a>

Service-oriented architectures (SOA) define services with well-delineated functions defined by business needs. Microservices use domain models and bounded context to draw service boundaries along business context boundaries. Focusing on business domains and functionality helps teams define independent reliability requirements for their services. Bounded contexts isolate and encapsulate business logic, allowing teams to better reason about how to handle failures.

 **Desired outcome:** Engineers and business stakeholders jointly define bounded contexts and use them to design systems as services that fulfill specific business functions. These teams use established practices like event storming to define requirements. New applications are designed as services well-defined boundaries and loosely coupling. Existing monoliths are decomposed into [bounded contexts](https://martinfowler.com/bliki/BoundedContext.html) and system designs move towards SOA or microservice architectures. When monoliths are refactored, established approaches like bubble contexts and monolith decomposition patterns are applied. 

 Domain-oriented services are executed as one or more processes that don’t share state. They independently respond to fluctuations in demand and handle fault scenarios in light of domain specific requirements. 

 **Common anti-patterns:** 
+  Teams are formed around specific technical domains like UI and UX, middleware, or database instead of specific business domains. 
+  Applications span domain responsibilities. Services that span bounded contexts can be more difficult to maintain, require larger testing efforts, and require multiple domain teams to participate in software updates. 
+  Domain dependencies, like domain entity libraries, are shared across services such that changes for one service domain require changes to other service domains 
+  Service contracts and business logic don’t express entities in a common and consistent domain language, resulting in translation layers that complicate systems and increase debugging efforts. 

 **Benefits of establishing this best practice:** Applications are designed as independent services bounded by business domains and use a common business language. Services are independently testable and deployable. Services meet domain specific resiliency requirements for the domain implemented. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Domain-driven design (DDD) is the foundational approach of designing and building software around business domains. It’s helpful to work with an existing framework when building services focused on business domains. When working with existing monolithic applications, you can take advantage of decomposition patterns that provide established techniques to modernize applications into services. 

![\[Flow chart depicting the approach of domain-driven design.\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/domain-driven-decision.png)


## Implementation steps
<a name="implementation-steps"></a>
+  Teams can hold [event storming](https://serverlessland.com/event-driven-architecture/visuals/event-storming) workshops to quickly identify events, commands, aggregates and domains in a lightweight sticky note format. 
+  Once domain entities and functions have been formed in a domain context, you can divide your domain into services using [bounded context](https://martinfowler.com/bliki/BoundedContext.html), where entities that share similar features and attributes are grouped together. With the model divided into contexts, a template for how to boundary microservices emerges. 
  +  For example, the Amazon.com website entities might include package, delivery, schedule, price, discount, and currency. 
  +  Package, delivery, and schedule are grouped into the shipping context, while price, discount, and currency are grouped into the pricing context. 
+  [Decomposing monoliths into microservices](https://docs.aws.amazon.com/prescriptive-guidance/latest/modernization-decomposing-monoliths/welcome.html) outlines patterns for refactoring microservices. Using patterns for decomposition by business capability, subdomain, or transaction aligns well with domain-driven approaches. 
+  Tactical techniques such as the [bubble context](https://www.domainlanguage.com/wp-content/uploads/2016/04/GettingStartedWithDDDWhenSurroundedByLegacySystemsV1.pdf) allow you to introduce DDD in existing or legacy applications without up-front rewrites and full commitments to DDD. In a bubble context approach, a small bounded context is established using a service mapping and coordination, or [anti-corruption layer](https://serverlessland.com/event-driven-architecture/visuals/messages-between-bounded-context), which protects the newly defined domain model from external influences. 

 After teams have performed domain analysis and defined entities and service contracts, they can take advantage of AWS services to implement their domain-driven design as cloud-based services. 
+  Start your development by defining tests that exercise business rules of your domain. Test-driven development (TDD) and behavior-driven development (BDD) help teams keep services focused on solving business problems. 
+  Select the [AWS services](https://aws.amazon.com/microservices/) that best meet your business domain requirements and [microservice architecture](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/microservices-on-aws.html): 
  +  [AWS Serverless](https://aws.amazon.com/serverless/) allows your team focus on specific domain logic instead of managing servers and infrastructure. 
  +  [Containers at AWS](https://aws.amazon.com/containers/) simplify the management of your infrastructure, so you can focus on your domain requirements. 
  +  [Purpose built databases](https://aws.amazon.com/products/databases/) help you match your domain requirements to the best fit database type. 
+  [Building hexagonal architectures on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/hexagonal-architectures/welcome.html) outlines a framework to build business logic into services working backwards from a business domain to fulfill functional requirements and then attach integration adapters. Patterns that separate interface details from business logic with AWS services help teams focus on domain functionality and improve software quality. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL03-BP03 Provide service contracts per API](rel_service_architecture_api_contracts.md) 

 **Related documents:** 
+ [AWS Microservices](https://aws.amazon.com/microservices/)
+  [Implementing Microservices on AWS](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/introduction.html) 
+  [How to break a Monolith into Microservices](https://martinfowler.com/articles/break-monolith-into-microservices.html) 
+  [Getting Started with DDD when Surrounded by Legacy Systems](https://domainlanguage.com/wp-content/uploads/2016/04/GettingStartedWithDDDWhenSurroundedByLegacySystemsV1.pdf) 
+ [ Domain-Driven Design: Tackling Complexity in the Heart of Software ](https://www.amazon.com/gp/product/0321125215)
+ [ Building hexagonal architectures on AWS](https://docs.aws.amazon.com/prescriptive-guidance/latest/hexagonal-architectures/welcome.html)
+ [ Decomposing monoliths into microservices ](https://docs.aws.amazon.com/prescriptive-guidance/latest/modernization-decomposing-monoliths/welcome.html)
+ [ Event Storming ](https://serverlessland.com/event-driven-architecture/visuals/event-storming)
+ [ Messages Between Bounded Contexts ](https://serverlessland.com/event-driven-architecture/visuals/messages-between-bounded-context)
+ [ Microservices ](https://www.martinfowler.com/articles/microservices.html)
+ [ Test-driven development ](https://en.wikipedia.org/wiki/Test-driven_development)
+ [ Behavior-driven development ](https://en.wikipedia.org/wiki/Behavior-driven_development)

 **Related examples:** 
+ [ Designing Cloud Native Microservices on AWS (from DDD/EventStormingWorkshop) ](https://github.com/aws-samples/designing-cloud-native-microservices-on-aws/tree/main)

 **Related tools:** 
+ [AWS Cloud Databases ](https://aws.amazon.com/products/databases/)
+ [ Serverless on AWS](https://aws.amazon.com/serverless/)
+ [ Containers at AWS](https://aws.amazon.com/containers/)

# REL03-BP03 Provide service contracts per API
<a name="rel_service_architecture_api_contracts"></a>

Service contracts are documented agreements between API producers and consumers defined in a machine-readable API definition. A contract versioning strategy allows consumers to continue using the existing API and migrate their applications to a newer API when they are ready. Producer deployment can happen any time as long as the contract is followed. Service teams can use the technology stack of their choice to satisfy the API contract. 

 **Desired outcome:** Applications built with service-oriented or microservice architectures are able to operate independently while having integrated runtime dependency. Changes deployed to an API consumer or producer do not interrupt the stability of the overall system when both sides follow a common API contract. Components that communicate over service APIs can perform independent functional releases, upgrades to runtime dependencies, or fail over to a disaster recovery (DR) site with little or no impact to each other. In addition, discrete services are able to independently scale absorbing resource demand without requiring other services to scale in unison. 

 **Common anti-patterns:** 
+  Creating service APIs without strongly typed schemas. This results in APIs that cannot be used to generate API bindings and payloads that can’t be programmatically validated. 
+  Not adopting a versioning strategy, which forces API consumers to update and release or fail when service contracts evolve. 
+  Error messages that leak details of the underlying service implementation rather than describe integration failures in the domain context and language. 
+  Not using API contracts to develop test cases and mock API implementations to allow for independent testing of service components. 

 **Benefits of establishing this best practice:** Distributed systems composed of components that communicate over API service contracts can improve reliability. Developers can catch potential issues early in the development process with type checking during compilation to verify that requests and responses follow the API contract and required fields are present. API contracts provide a clear self-documenting interface for APIs and provider better interoperability between different systems and programming languages. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Once you have identified business domains and determined your workload segmentation, you can develop your service APIs. First, define machine-readable service contracts for APIs, and then implement an API versioning strategy. When you are ready to integrate services over common protocols like REST, GraphQL, or asynchronous events, you can incorporate AWS services into your architecture to integrate your components with strongly-typed API contracts. 

 **AWS services for service API contrats** 

 Incorporate AWS services including [Amazon API Gateway](https://aws.amazon.com/api-gateway/), [AWS AppSync](https://aws.amazon.com/appsync/), and [Amazon EventBridge](https://aws.amazon.com/eventbridge/) into your architecture to use API service contracts in your application. Amazon API Gateway helps you integrate with directly native AWS services and other web services. API Gateway supports the [OpenAPI specification](https://github.com/OAI/OpenAPI-Specification) and versioning. AWS AppSync is a managed [GraphQL](https://graphql.org/) endpoint you configure by defining a GraphQL schema to define a service interface for queries, mutations and subscriptions. Amazon EventBridge uses event schemas to define events and generate code bindings for your events. 

## Implementation steps
<a name="implementation-steps"></a>
+  First, define a contract for your API. A contract will express the capabilities of an API as well as define strongly typed data objects and fields for the API input and output. 
+  When you configure APIs in API Gateway, you can import and export OpenAPI Specifications for your endpoints. 
  +  [Importing an OpenAPI definition](https://docs.aws.amazon.com/apigateway/latest/developerguide/import-edge-optimized-api.html) simplifies the creation of your API and can be integrated with AWS infrastructure as code tools like the [AWS Serverless Application Model](https://aws.amazon.com/serverless/sam/) and [AWS Cloud Development Kit (AWS CDK)](https://aws.amazon.com/cdk/). 
  +  [Exporting an API definition](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-export-api.html) simplifies integrating with API testing tools and provides services consumer an integration specification. 
+  You can define and manage GraphQL APIs with AWS AppSync by [defining a GraphQL schema](https://docs.aws.amazon.com/appsync/latest/devguide/designing-your-schema.html) file to generate your contract interface and simplify interaction with complex REST models, multiple database tables or legacy services. 
+  [AWS Amplify](https://aws.amazon.com/amplify/) projects that are integrated with AWS AppSync generate strongly typed JavaScript query files for use in your application as well as an AWS AppSync GraphQL client library for [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) tables. 
+  When you consume service events from Amazon EventBridge, events adhere to schemas that already exist in the schema registry or that you define with the OpenAPI Spec. With a schema defined in the registry, you can also generate client bindings from the schema contract to integrate your code with events. 
+  Extending or version your API. Extending an API is a simpler option when adding fields that can be configured with optional fields or default values for required fields. 
  +  JSON based contracts for protocols like REST and GraphQL can be a good fit for contract extension. 
  +  XML based contracts for protocols like SOAP should be tested with service consumers to determine the feasibility of contract extension. 
+  When versioning an API, consider implementing proxy versioning where a facade is used to support versions so that logic can be maintained in a single codebase. 
  +  With API Gateway you can use [request and response mappings](https://docs.aws.amazon.com/apigateway/latest/developerguide/request-response-data-mappings.html#transforming-request-response-body) to simplify absorbing contract changes by establishing a facade to provide default values for new fields or to strip removed fields from a request or response. With this approach the underlying service can maintain a single codebase. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL03-BP01 Choose how to segment your workload](rel_service_architecture_monolith_soa_microservice.md) 
+  [REL03-BP02 Build services focused on specific business domains and functionality](rel_service_architecture_business_domains.md) 
+  [REL04-BP02 Implement loosely coupled dependencies](rel_prevent_interaction_failure_loosely_coupled_system.md) 
+  [REL05-BP03 Control and limit retry calls](rel_mitigate_interaction_failure_limit_retries.md) 
+  [REL05-BP05 Set client timeouts](rel_mitigate_interaction_failure_client_timeouts.md) 

 **Related documents:** 
+ [ What Is An API (Application Programming Interface)? ](https://aws.amazon.com/what-is/api/)
+ [ Implementing Microservices on AWS](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/microservices-on-aws.html)
+ [ Microservice Trade-Offs ](https://martinfowler.com/articles/microservice-trade-offs.html)
+ [ Microservices - a definition of this new architectural term ](https://www.martinfowler.com/articles/microservices.html)
+ [ Microservices on AWS](https://aws.amazon.com/microservices/)
+ [ Working with API Gateway extensions to OpenAPI ](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-swagger-extensions.html)
+ [ OpenAPI-Specification ](https://github.com/OAI/OpenAPI-Specification)
+ [ GraphQL: Schemas and Types ](https://graphql.org/learn/schema/)
+ [ Amazon EventBridge code bindings ](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-schema-code-bindings.html)

 **Related examples:** 
+ [ Amazon API Gateway: Configuring a REST API Using OpenAPI ](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-import-api.html)
+ [ Amazon API Gateway to Amazon DynamoDB CRUD application using OpenAPI ](https://serverlessland.com/patterns/apigw-ddb-openapi-crud?ref=search)
+ [ Modern application integration patterns in a serverless age: API Gateway Service Integration ](https://catalog.us-east-1.prod.workshops.aws/workshops/be7e1ee7-b91f-493d-93b0-8f7c5b002479/en-US/labs/asynchronous-request-response-poll/api-gateway-service-integration)
+ [ Implementing header-based API Gateway versioning with Amazon CloudFront ](https://aws.amazon.com/blogs/compute/implementing-header-based-api-gateway-versioning-with-amazon-cloudfront/)
+ [AWS AppSync: Building a client application ](https://docs.aws.amazon.com/appsync/latest/devguide/building-a-client-app.html#aws-appsync-building-a-client-app)

 **Related videos:** 
+ [ Using OpenAPI in AWS SAM to manage API Gateway ](https://www.youtube.com/watch?v=fet3bh0QA80)

 **Related tools:** 
+ [ Amazon API Gateway ](https://aws.amazon.com/api-gateway/)
+ [AWS AppSync](https://aws.amazon.com/appsync/)
+ [ Amazon EventBridge ](https://aws.amazon.com/eventbridge/)

# Design interactions in a distributed system to prevent failures
<a name="design-interactions-in-a-distributed-system-to-prevent-failures"></a>

 Distributed systems rely on communications networks to interconnect components, such as servers or services. Your workload must operate reliably despite data loss or latency in these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices prevent failures and improve mean time between failures (MTBF). 

**Topics**
+ [

# REL04-BP01 Identify the kind of distributed systems you depend on
](rel_prevent_interaction_failure_identify.md)
+ [

# REL04-BP02 Implement loosely coupled dependencies
](rel_prevent_interaction_failure_loosely_coupled_system.md)
+ [

# REL04-BP03 Do constant work
](rel_prevent_interaction_failure_constant_work.md)
+ [

# REL04-BP04 Make mutating operations idempotent
](rel_prevent_interaction_failure_idempotent.md)

# REL04-BP01 Identify the kind of distributed systems you depend on
<a name="rel_prevent_interaction_failure_identify"></a>

 Distributed systems can be synchronous, asynchronous, or batch. Synchronous systems must process requests as quickly as possible and communicate with each other by making synchronous request and response calls using HTTP/S, REST, or remote procedure call (RPC) protocols. Asynchronous systems communicate with each other by exchanging data asynchronously through an intermediary service without coupling individual systems. Batch systems receive a large volume of input data, run automated data processes without human intervention, and generate output data. 

 **Desired outcome**: Design a workload that effectively interacts with synchronous, asynchronous, and batch dependencies. 

 **Common anti-patterns**: 
+  Workload waits indefinitely for a response from its dependencies, which could lead to workload clients timing out, not knowing if their request has been received. 
+  Workload uses a chain of dependent systems that call each other synchronously. This requires each system to be available and to successfully process a request before the whole chain can succeed, leading to potentially brittle behavior and overall availability. 
+  Workload communicates with its dependencies asynchronously and rely on the concept of exactly-once guaranteed delivery of messages, when often it is still possible to receive duplicate messages. 
+  Workload does not use proper batch scheduling tools and allows concurrent execution of the same batch job. 

 **Benefits of establishing this best practice**: It is common for a given workload to implement one or more style of communication between synchronous, asynchronous, and batch. This best practice helps you identify the different trade-offs associated with each style of communication to make your workload able to tolerate disruptions in any of its dependencies. 

 **Level of risk exposed if this best practice is not established**: High 

## Implementation guidance
<a name="implementation-guidance"></a>

 The following sections contain both general and specific implementation guidance for each kind of dependency. 

 **General guidance** 
+  Make sure that the performance and reliability service-level objectives (SLOs) that your dependencies offer meet the performance and reliability requirements of your workload. 
+  Use [AWS observability services](https://aws.amazon.com/cloudops/monitoring-and-observability) to [monitor response times and error rates](https://www.youtube.com/watch?v=or7uFFyHIX0) to make sure your dependency is providing service at the levels needed by your workload. 
+  Identify the potential challenges that your workload may face when communicating with its dependencies. Distributed systems [come with a wide range of challenges](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/) that might increase architectural complexity, operational burden, and cost. Common challenges include latency, network disruptions, data loss, scaling, and data replication lag. 
+  Implement robust error handling and [logging](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) to help you troubleshoot problems when your dependency experiences issues. 

 **Synchronous dependency** 

 In synchronous communications, your workload sends a request to its dependency and blocks the operation waiting for a response. When its dependency receives the request, it tries to handle it as soon as possible and sends a response back to your workload. A significant challenge with synchronous communication is that it causes temporal coupling, which requires your workload and its dependencies to be available at the same time. When your workload needs to communicate synchronously with its dependencies, consider the following guidance: 
+  Your workload should not rely on multiple synchronous dependencies to perform a single function. This chain of dependencies increases overall brittleness because all dependencies in the pathway need to be available in order for the request to complete successfully. 
+  When a dependency is unhealthy or unavailable, determine your error handling and retry strategies. Avoid using bimodal behavior. Bimodal behavior is when your workload exhibits different behavior under normal and failure modes. For more details on bimodal behavior, see [REL11-BP05 Use static stability to prevent bimodal behavior.](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_static_stability.html) 
+  Keep in mind that failing fast is better than making your workload wait. For instance, the [AWS Lambda Developer Guide](https://docs.aws.amazon.com/lambda/latest/dg/invocation-retries.html) describes how to handle retries and failures when you invoke Lambda functions. 
+  Set timeouts when your workload calls its dependency. This technique avoids waiting too long or waiting indefinitely for a response. For helpful discussion of this issue, see [Tuning AWS Java SDK HTTP request settings for latency-aware Amazon DynamoDB applications](https://aws.amazon.com/blogs/database/tuning-aws-java-sdk-http-request-settings-for-latency-aware-amazon-dynamodb-applications/). 
+  Minimize the number of calls made from your workload to its dependency to fulfill a single request. Having chatty calls between them increases coupling and latency. 

 **Asynchronous dependency** 

 To temporally decouple your workload from its dependency, they should communicate asynchronously. Using an asynchronous approach, your workload can continue with any other processing without having to wait for its dependency, or chain of dependencies, to send a response. 

 When your workload needs to communicate asynchronously with its dependency, consider the following guidance: 
+  Determine whether to use messaging or event streaming based on your use case and requirements. [Messaging](https://aws.amazon.com/messaging/) allows your workload to communicate with its dependency by sending and receiving messages through a message broker. [Event streaming](https://aws.amazon.com/streaming-data/) allows your workload and its dependency to use a streaming service to publish and subscribe to events, delivered as continuous streams of data, that need to be processed as soon as possible. 
+  Messaging and event streaming handle messages differently so you need to make trade-off decisions based on: 
  +  **Message priority:** message brokers can process high-priority messages ahead of normal messages. In event streaming, all messages have the same priority. 
  +  **Message consumption**: message brokers ensure that consumers receive the message. Event streaming consumers must keep track of the last message they have read. 
  +  **Message ordering**: with messaging, receiving messages in the exact order they are sent is not guaranteed unless you use a first-in-first-out (FIFO) approach. Event streaming always preserves the order in which the data was produced. 
  +  **Message deletion**: with messaging, the consumer must delete the message after processing it. The event streaming service appends the message to a stream and remains in there until the message's retention period expires. This deletion policy makes event streaming suitable for replaying messages. 
+  Define how your workload knows when its dependency completes its work. For instance, when your workload invokes a [Lambda function asynchronously](https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html), Lambda places the event in a queue and returns a success response without additional information. After processing is complete, the Lambda function can [send the result to a destination](https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html#invocation-async-destinations), configurable based on success or failure. 
+  Build your workload to handle duplicate messages by leveraging idempotency. Idempotency means that the results of your workload do not change even if your workload is generated more than once for the same message. It is important to point out that [messaging](https://aws.amazon.com/sqs/faqs/#FIFO_queues) or [streaming](https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html) services will redeliver a message if a network failure occurs or if an acknowledgement has not been received. 
+  If your workload does not get a response from its dependency, it needs to resubmit the request. Consider limiting the number of retries to preserve your workload's CPU, memory, and network resources to handle other requests. The [AWS Lambda documentation](https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html#invocation-async-errors) shows how to handle errors for asynchronous invocation. 
+  Leverage suitable observability, debugging, and tracing tools to manage and operate your workload's asynchronous communication with its dependency. You can use [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) to monitor [messaging](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html) and [event streaming](https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html) services. You can also instrument your workload with [AWS X-Ray](https://aws.amazon.com/xray/) to quickly [gain insights](https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html) for troubleshooting problems. 

 **Batch dependency** 

 Batch systems take input data, initiate a series of jobs to process it, and produce some output data, without manual intervention. Depending on the data size, jobs could run from minutes to, in some cases, several days. When your workload communicates with its batch dependency, consider the following guidance: 
+  Define the time window when your workload should run the batch job. Your workload can set up a recurrence pattern to invoke a batch system, for example, every hour or at the end of every month. 
+  Determine the location of the data input and the processed data output. Choose a storage service, such as [Amazon Simple Storage Services (Amazon S3)](https://aws.amazon.com/s3/), [Amazon Elastic File System (Amazon EFS)](https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html), and [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html), that allows your workload to read and write files at scale. 
+  If your workload needs to invoke multiple batch jobs, you could leverage [AWS Step Functions](https://aws.amazon.com/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&step-functions.sort-order=desc) to simplify the orchestration of batch jobs that run in AWS or on-premises. This [sample project](https://github.com/aws-samples/aws-stepfunction-complex-orchestrator-app) demonstrates orchestration of batch jobs using Step Functions, [AWS Batch](https://aws.amazon.com/batch/), and Lambda. 
+  Monitor batch jobs to look for abnormalities, such as a job taking longer than it should to complete. You could use tools like [CloudWatch Container Insights](https://docs.aws.amazon.com/batch/latest/userguide/cloudwatch-container-insights.html) to monitor AWS Batch environments and jobs. In this instance, your workload would stop the next job from beginning and inform the relevant staff of the exception. 

## Resources
<a name="resources"></a>

 **Related documents**: 
+  [AWS Cloud Operations: Monitoring and Observability](https://aws.amazon.com/cloudops/monitoring-and-observability) 
+  [The Amazon's Builder Library: Challenges with distributed systems](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/) 
+  [REL11-BP05 Use static stability to prevent bimodal behavior](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_static_stability.html) 
+  [AWS Lambda Developer Guide: Error handling and automatic retries in AWS Lambda](https://docs.aws.amazon.com/lambda/latest/dg/invocation-retries.html) 
+  [Tuning AWS Java SDK HTTP request settings for latency-aware Amazon DynamoDB applications](https://aws.amazon.com/blogs/database/tuning-aws-java-sdk-http-request-settings-for-latency-aware-amazon-dynamodb-applications/) 
+  [AWS Messaging](https://aws.amazon.com/messaging/) 
+  [What is streaming data?](https://aws.amazon.com/streaming-data/) 
+  [AWS Lambda Developer Guide: Asynchronous invocation](https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html) 
+  [Amazon Simple Queue Service FAQ: FIFO queues](https://aws.amazon.com/sqs/faqs/#FIFO_queues) 
+  [Amazon Kinesis Data Streams Developer Guide: Handling Duplicate Records](https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-duplicates.html) 
+  [Amazon Simple Queue Service Developer Guide: Available CloudWatch metrics for Amazon SQS](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html) 
+  [Amazon Kinesis Data Streams Developer Guide: Monitoring the Amazon Kinesis Data Streams Service with Amazon CloudWatch](https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html) 
+  [AWS X-Ray Developer Guide: AWS X-Ray concepts](https://docs.aws.amazon.com/xray/latest/devguide/xray-concepts.html) 
+  [AWS Samples on GitHub: AWS Step functions Complex Orchestrator App](https://github.com/aws-samples/aws-stepfunction-complex-orchestrator-app) 
+  [AWS Batch User Guide: AWS Batch CloudWatch Container Insights](https://docs.aws.amazon.com/batch/latest/userguide/cloudwatch-container-insights.html) 

 **Related videos**: 
+  [AWS Summit SF 2022 - Full-stack observability and application monitoring with AWS (COP310)](https://www.youtube.com/watch?v=or7uFFyHIX0) 

 **Related tools**: 
+  [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) 
+  [Amazon CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) 
+  [AWS X-Ray](https://aws.amazon.com/xray/) 
+  [Amazon Simple Storage Services (Amazon S3)](https://aws.amazon.com/s3/) 
+  [Amazon Elastic File System (Amazon EFS)](https://docs.aws.amazon.com/efs/latest/ug/whatisefs.html) 
+  [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) 
+  [AWS Step Functions](https://aws.amazon.com/step-functions/?step-functions.sort-by=item.additionalFields.postDateTime&step-functions.sort-order=desc) 
+  [AWS Batch](https://aws.amazon.com/batch/) 

# REL04-BP02 Implement loosely coupled dependencies
<a name="rel_prevent_interaction_failure_loosely_coupled_system"></a>

 Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility. 

 Decoupling dependencies, such as queuing systems, streaming systems, and workflows, help minimize the impact of changes or failure on a system. This separation isolates a component's behavior from affecting others that depend on it, improving resilience and agility. 

 In tightly coupled systems, changes to one component can necessitate changes in other components that rely on it, resulting in degraded performance across all components. *Loose* coupling breaks this dependency so that dependent components only need to know the versioned and published interface. Implementing loose coupling between dependencies isolates a failure in one from impacting another. 

 Loose coupling allows you to modify code or add features to a component while minimizing risk to other components that depend on it. It also allows for granular resilience at a component level where you can scale out or even change underlying implementation of the dependency. 

 To further improve resiliency through loose coupling, make component interactions asynchronous where possible. This model is suitable for any interaction that does not need an immediate response and where an acknowledgment that a request has been registered will suffice. It involves one component that generates events and another that consumes them. The two components do not integrate through direct point-to-point interaction but usually through an intermediate durable storage layer, such as an Amazon SQS queue, a streaming data platform such as Amazon Kinesis, or AWS Step Functions. 

![\[Diagram showing dependencies such as queuing systems and load balancers are loosely coupled\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/dependency-diagram.png)


 Amazon SQS queues and AWS Step Functions are just two ways to add an intermediate layer for loose coupling. Event-driven architectures can also be built in the AWS Cloud using Amazon EventBridge, which can abstract clients (event producers) from the services they rely on (event consumers). Amazon Simple Notification Service (Amazon SNS) is an effective solution when you need high-throughput, push-based, many-to-many messaging. Using Amazon SNS topics, your publisher systems can fan out messages to a large number of subscriber endpoints for parallel processing. 

 While queues offer several advantages, in most hard real-time systems, requests older than a threshold time (often seconds) should be considered stale (the client has given up and is no longer waiting for a response), and not processed. This way more recent (and likely still valid requests) can be processed instead. 

 **Desired outcome:** Implementing loosely coupled dependencies allows you to minimize the surface area for failure to a component level, which helps diagnose and resolve issues. It also simplifies development cycles, allowing teams to implement changes at a modular level without affecting the performance of other components that depend on it. This approach provides the capability to scale out at a component level based on resource needs, as well as utilization of a component contributing to cost-effectiveness. 

 **Common anti-patterns:** 
+  Deploying a monolithic workload. 
+  Directly invoking APIs between workload tiers with no capability of failover or asynchronous processing of the request. 
+  Tight coupling using shared data. Loosely coupled systems should avoid sharing data through shared databases or other forms of tightly coupled data storage, which can reintroduce tight coupling and hinder scalability. 
+  Ignoring back pressure. Your workload should have the ability to slow down or stop incoming data when a component can't process it at the same rate. 

 **Benefits of establishing this best practice:** Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility. Failure in one component is isolated from others. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implement loosely coupled dependencies. There are various solutions that allow you to build loosely coupled applications. These include services for implementing fully managed queues, automated workflows, react to events, and APIs among others which can help isolate behavior of components from other components, and as such increasing resilience and agility. 
+  **Build event-driven architectures:** [Amazon EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html) helps you build loosely coupled and distributed event-driven architectures. 
+  **Implement queues in distributed systems:** You can use [Amazon Simple Queue Service (Amazon SQS)](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) to integrate and decouple distributed systems. 
+  **Containerize components as microservices:** [Microservices](https://aws.amazon.com/microservices/) allow teams to build applications composed of small independent components which communicate over well-defined APIs. [Amazon Elastic Container Service (Amazon ECS)](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/Welcome.html), and [Amazon Elastic Kubernetes Service (Amazon EKS)](https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html) can help you get started faster with containers. 
+  **Manage workflows with Step Functions:** [Step Functions](https://aws.amazon.com/step-functions/getting-started/) help you coordinate multiple AWS services into flexible workflows. 
+  **Leverage publish-subscribe (pub/sub) messaging architectures:** [Amazon Simple Notification Service (Amazon SNS)](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) provides message delivery from publishers to subscribers (also known as producers and consumers). 

### Implementation steps
<a name="implementation-steps"></a>
+  Components in an event-driven architecture are initiated by events. Events are actions that happen in a system, such as a user adding an item to a cart. When an action is successful, an event is generated that actuates the next component of the system. 
  + [ Building Event-driven Applications with Amazon EventBridge ](https://aws.amazon.com/blogs/compute/building-an-event-driven-application-with-amazon-eventbridge/)
  + [AWS re:Invent 2022 - Designing Event-Driven Integrations using Amazon EventBridge ](https://www.youtube.com/watch?v=W3Rh70jG-LM)
+  Distributed messaging systems have three main parts that need to be implemented for a queue based architecture. They include components of the distributed system, the queue that is used for decoupling (distributed on Amazon SQS servers), and the messages in the queue. A typical system has producers which initiate the message into the queue, and the consumer which receives the message from the queue. The queue stores messages across multiple Amazon SQS servers for redundancy. 
  + [ Basic Amazon SQS architecture ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-basic-architecture.html)
  + [ Send Messages Between Distributed Applications with Amazon Simple Queue Service ](https://aws.amazon.com/getting-started/hands-on/send-messages-distributed-applications/)
+  Microservices, when well-utilized, enhance maintainability and boost scalability, as loosely coupled components are managed by independent teams. It also allows for the isolation of behaviors to a single component in case of changes. 
  + [ Implementing Microservices on AWS](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/microservices-on-aws.html)
  + [ Let's Architect\$1 Architecting microservices with containers ](https://aws.amazon.com/blogs/architecture/lets-architect-architecting-microservices-with-containers/)
+  With AWS Step Functions you can build distributed applications, automate processes, orchestrate microservices, among other things. The orchestration of multiple components into an automated workflow allows you to decouple dependencies in your application. 
  + [ Create a Serverless Workflow with AWS Step Functions and AWS Lambda](https://aws.amazon.com/tutorials/create-a-serverless-workflow-step-functions-lambda/)
  + [ Getting Started with AWS Step Functions](https://aws.amazon.com/step-functions/getting-started/)

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon EC2: Ensuring Idempotency](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html) 
+  [The Amazon Builders' Library: Challenges with distributed systems](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/) 
+  [The Amazon Builders' Library: Reliability, constant work, and a good cup of coffee](https://aws.amazon.com/builders-library/reliability-and-constant-work/) 
+  [What Is Amazon EventBridge?](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html) 
+  [What Is Amazon Simple Queue Service?](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/welcome.html) 
+ [ Break up with your monolith ](https://pages.awscloud.com/break-up-your-monolith.html)
+ [ Orchestrate Queue-based Microservices with AWS Step Functions and Amazon SQS ](https://aws.amazon.com/tutorials/orchestrate-microservices-with-message-queues-on-step-functions/)
+ [ Basic Amazon SQS architecture ](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-basic-architecture.html)
+ [ Queue-Based Architecture ](https://docs.aws.amazon.com/wellarchitected/latest/high-performance-computing-lens/queue-based-architecture.html)

 **Related videos:** 
+  [AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205)](https://youtu.be/tvELVa9D9qU) 
+  [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes loose coupling, constant work, static stability)](https://youtu.be/O8xLxNje30M) 
+  [AWS re:Invent 2019: Moving to event-driven architectures (SVS308)](https://youtu.be/h46IquqjF3E) 
+ [AWS re:Invent 2019: Scalable serverless event-driven applications using Amazon SQS and Lambda ](https://www.youtube.com/watch?v=2rikdPIFc_Q)
+ [AWS re:Invent 2022 - Designing event-driven integrations using Amazon EventBridge ](https://www.youtube.com/watch?v=W3Rh70jG-LM)
+ [AWS re:Invent 2017: Elastic Load Balancing Deep Dive and Best Practices ](https://www.youtube.com/watch?v=9TwkMMogojY)

# REL04-BP03 Do constant work
<a name="rel_prevent_interaction_failure_constant_work"></a>

 Systems can fail when there are large, rapid changes in load. For example, if your workload is doing a health check that monitors the health of thousands of servers, it should send the same size payload (a full snapshot of the current state) each time. Whether no servers are failing, or all of them, the health check system is doing constant work with no large, rapid changes. 

 For example, if the health check system is monitoring 100,000 servers, the load on it is nominal under the normally light server failure rate. However, if a major event makes half of those servers unhealthy, then the health check system would be overwhelmed trying to update notification systems and communicate state to its clients. So instead the health check system should send the full snapshot of the current state each time. 100,000 server health states, each represented by a bit, would only be a 12.5-KB payload. Whether no servers are failing, or all of them are, the health check system is doing constant work, and large, rapid changes are not a threat to the system stability. This is actually how Amazon Route 53 handles health checks for endpoints (such as IP addresses) to determine how end users are routed to them. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Do constant work so that systems do not fail when there are large, rapid changes in load. 
+  Implement loosely coupled dependencies. Dependencies such as queuing systems, streaming systems, workflows, and load balancers are loosely coupled. Loose coupling helps isolate behavior of a component from other components that depend on it, increasing resiliency and agility. 
  +  [The Amazon Builders' Library: Reliability, constant work, and a good cup of coffee](https://aws.amazon.com/builders-library/reliability-and-constant-work/) 
  +  [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes constant work)](https://youtu.be/O8xLxNje30M?t=2482) 
    +  For the example of a health check system monitoring 100,000 servers, engineer workloads so that payload sizes remain constant regardless of number of successes or failures. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon EC2: Ensuring Idempotency](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/Run_Instance_Idempotency.html) 
+  [The Amazon Builders' Library: Challenges with distributed systems](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/) 
+  [The Amazon Builders' Library: Reliability, constant work, and a good cup of coffee](https://aws.amazon.com/builders-library/reliability-and-constant-work/) 

 **Related videos:** 
+  [AWS New York Summit 2019: Intro to Event-driven Architectures and Amazon EventBridge (MAD205)](https://youtu.be/tvELVa9D9qU) 
+  [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes constant work)](https://youtu.be/O8xLxNje30M?t=2482) 
+  [AWS re:Invent 2018: Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes loose coupling, constant work, static stability)](https://youtu.be/O8xLxNje30M) 
+  [AWS re:Invent 2019: Moving to event-driven architectures (SVS308)](https://youtu.be/h46IquqjF3E) 

# REL04-BP04 Make mutating operations idempotent
<a name="rel_prevent_interaction_failure_idempotent"></a>

 An idempotent service promises that each request is processed exactly once, such that making multiple identical requests has the same effect as making a single request. This makes it easier for a client to implement retries without fear that a request is erroneously processed multiple times. To do this, clients can issue API requests with an idempotency token, which is used whenever the request is repeated. An idempotent service API uses the token to return a response identical to the response that was returned the first time that the request was completed, even if the underlying state of the system has changed. 

 In a distributed system, it is relatively simple to perform an action at most once (client makes only one request) or at least once (keep requesting until client gets confirmation of success). It is more difficult to guarantee an action is performed *exactly once*, such that making multiple identical requests has the same effect as making a single request. Using idempotency tokens in APIs, services can receive a mutating request one or more times without the need to create duplicate records or side effects. 

 **Desired outcome:** You have a consistent, well-documented, and widely adopted approach for ensuring idempotency across all components and services. 

 **Common anti-patterns:** 
+  You apply idempotency indiscriminately, even when not needed. 
+  You introduce overly complex logic for implementing idempotency. 
+  You use timestamps as keys for idempotency. This can cause inaccuracies due to clock skew or due to multiple clients that use the same timestamps to apply changes. 
+  You store entire payloads for idempotency. In this approach, you save complete data payloads for every request and overwrite it at each new request. This can degrade performance and affect scalability. 
+  You generate keys inconsistently across services. Without consistent keys, services may fail to recognize duplicate requests, which results in unintended results. 

 **Benefits of establishing this best practice:** 
+  Greater scalability: The system can handle retries and duplicate requests without having to perform additional logic or complex state management. 
+  Enhanced reliability: Idempotency helps services handle multiple identical requests in a consistent manner, which reduces the risk of unintended side effects or duplicate records. This is especially crucial in distributed systems, where network failures and retries are common. 
+  Improved data consistency: Because the same request produces the same response, idempotency helps maintain data consistency across distributed systems. This is essential to maintain the integrity of transactions and operations. 
+  Error handling: Idempotency tokens make error handling more straightforward. If a client does not receive a response due to an issue, it can safely resend the request with the same idempotency token. 
+  Operational transparency: Idempotency allows for better monitoring and logging. Services can log requests with their idempotency tokens, which makes it easier to trace and debug issues. 
+  Simplified API contract: It can simplify the contract between the client and server side systems and reduce the fear of erroneous data processing. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 In a distributed system, performing an action at most once (the client makes only one request) or at least once (the client keeps requesting until success is confirmed) is relatively straightforward. However, it's challenging to implement *exactly once* behavior. To achieve this, your clients should generate and provide an idempotency token for each request. 

 By using idempotency tokens, a service can distinguish between new requests and repeated ones. When a service receives a request with an idempotency token, it checks if the token has already been used. If the token has been used, the service retrieves and returns the stored response. If the token is new, the service processes the request, stores the response along with the token, and then returns the response. This mechanism makes all responses idempotent, which enhances the reliability and consistency of the distributed system. 

 Idempotency is also an important behavior of event-driven architectures. These architectures are typically backed by a message queue such as Amazon SQS, Amazon MQ, Amazon Kinesis Streams, or Amazon Managed Streaming for Apache Kafka (MSK). In some circumstances, a message that was published only once may be accidentally delivered more than once. When a publisher generates and includes idempotency tokens in messages, it requests that the processing of any duplicate message received doesn't result in a repeated action for the same message. Consumers should keep track of each token received and ignore messages that contain duplicate tokens. 

 Services and consumers should also pass the received idempotency token to any downstream services that it calls. Every downstream service in the processing chain is similarly responsible for making sure that idempotency is implemented to avoid the side effect of processing a message more than once. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Identify idempotent operations** 

    Determine which operations require idempotency. These typically include POST, PUT, and DELETE HTTP methods and database insert, update, or delete operations. Operations that do not mutate state, such as read-only queries, usually do not require idempotency unless they have side effects. 

1.  **Use unique identifiers** 

    Include a unique token in each idempotent operation request sent by the sender, either directly in the request or as part of its metadata (for example, an HTTP header). This allows the receiver to recognize and handle duplicate requests or operations. Identifiers commonly used for tokens include [Universally Unique Identifiers (UUIDs)](https://datatracker.ietf.org/doc/html/rfc9562) and [K-Sortable Unique Identifiers (KSUIDs)](https://github.com/segmentio/ksuid). 

1.  **Track and manage state** 

    Maintain the state of each operation or request in your workload. This can be achieved by storing the idempotency token and the corresponding state (such as pending, completed, or failed) in a database, cache, or other persistent store. This state information allows the workload to identify and handle duplicate requests or operations. 

    Maintain consistency and atomicity by using appropriate concurrency control mechanisms if needed, such as locks, transactions, or optimistic concurrency controls. This includes the process of recording the idempotent token and running all mutating operations associated with servicing the request. This helps prevent race conditions and verifies that idempotent operations run correctly. 

    Regularly remove old idempotency tokens from the datastore to manage storage and performance. If your storage system supports it, consider using expiration timestamps for data (often known as time to live, or TTL values). The likelihood of idempotency token reuse diminishes over time. 

    Common AWS storage options typically used for storing idempotency tokens and related state include: 
   +  **Amazon DynamoDB**: DynamoDB is a NoSQL database service that provides low-latency performance and high availability, which makes it well-suited for the storage of idempotency-related data. The key-value and document data model of DynamoDB allows for efficient storage and retrieval of idempotency tokens and associated state information. DynamoDB can also expire idempotency tokens automatically if your application sets a TTL value when it inserts them. 
   +  **Amazon ElastiCache**: ElastiCache can store idempotency tokens with high throughput, low latency, and at low cost. Both ElastiCache (Redis) and ElastiCache (Memcached) can also expire idempotency tokens automatically if your application sets a TTL value when it inserts them. 
   +  **Amazon Relational Database Service (RDS):** You can use Amazon RDS to store idempotency tokens and related state information, especially if your application already uses a relational database for other purposes. 
   +  **Amazon Simple Storage Service (S3):** Amazon S3 is a highly scalable and durable object storage service that can be used to store idempotency tokens and related metadata. The versioning capabilities of S3 can be particularly useful for maintenance of the state of idempotent operations. The choice of storage service typically depends on factors such as the volume of idempotency-related data, the required performance characteristics, the need for durability and availability, and how the idempotency mechanism integrates with the overall workload architecture. 

1.  **Implement idempotent operations** 

    Design your API and workload components to be idempotent. Incorporate idempotency checks into your workload components. Before you process a request or perform an operation, check if the unique identifier has already been processed. If it has, return the previous result instead of executing the operation again. For example, if a client sends a request to create a user, check if a user with the same unique identifier already exists. If the user exists, it should return the existing user information instead of creating a new one. Similarly, if a queue consumer receives a message with a duplicate idempotency token, the consumer should ignore the message. 

    Create comprehensive test suites that validate the idempotency of requests. They should cover a wide range of scenarios, such as successful requests, failed requests, and duplicate requests. 

    If your workload leverages AWS Lambda functions, consider Powertools for AWS Lambda. Powertools for AWS Lambda is a developer toolkit that helps implement serverless best practices and increase developer velocity when you work with AWS Lambda functions. In particular, it provides a utility to convert your Lambda functions into idempotent operations which are safe to retry. 

1.  **Communicate idempotency clearly** 

    Document your API and workload components to clearly communicate the idempotent nature of the operations. This helps clients understand the expected behavior and how to interact with your workload reliably. 

1.  **Monitor and audit** 

    Implement monitoring and auditing mechanisms to detect any issues related to the idempotency of responses, such as unexpected response variations or excessive duplicate request handling. This can help you detect and investigate any issues or unexpected behaviors in your workload. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL05-BP03 Control and limit retry calls](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_limit_retries.html) 
+  [REL06-BP01 Monitor all components for the workload (Generation)](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_monitor_aws_resources_monitor_resources.html) 
+  [REL06-BP03 Send notifications (Real-time processing and alarming)](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_monitor_aws_resources_notification_monitor.html)) 
+  [REL08-BP02 Integrate functional testing as part of your deployment](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_tracking_change_management_functional_testing.html) 

 **Related documents:** 
+  [The Amazon Builders' Library: Making retries safe with idempotent APIs](https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/) 
+  [The Amazon Builders' Library: Challenges with distributed systems](https://aws.amazon.com/builders-library/challenges-with-distributed-systems/) 
+  [The Amazon Builders' Library: Reliability, constant work, and a good cup of coffee](https://aws.amazon.com/builders-library/reliability-and-constant-work/) 
+  [Amazon Elastic Container Service: Ensuring idempotency](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/ECS_Idempotency.html) 
+  [How do I make my Lambda function idempotent?](https://repost.aws/knowledge-center/lambda-function-idempotent) 
+  [Ensuring idempotency in Amazon EC2 API requests](https://docs.aws.amazon.com/ec2/latest/devguide/ec2-api-idempotency.html) 

 **Related videos:** 
+  [Building Distributed Applications with Event-driven Architecture - AWS Online Tech Talks](https://www.youtube.com/watch?v=gA2-eqDVSng&t=1668s) 
+  [AWS re:Invent 2023 - Building next-generation applications with event-driven architecture](https://www.youtube.com/watch?v=KXR17uwLEC8) 
+  [AWS re:Invent 2023 - Advanced integration patterns & trade-offs for loosely coupled systems](https://www.youtube.com/watch?v=FGKGdUiZKto) 
+  [AWS re:Invent 2023 - Advanced event-driven patterns with Amazon EventBridge](https://www.youtube.com/watch?v=6X4lSPkn4ps) 
+  [AWS re:Invent 2018 - Close Loops and Opening Minds: How to Take Control of Systems, Big and Small ARC337 (includes loose coupling, constant work, static stability)](https://youtu.be/O8xLxNje30M) 
+  [AWS re:Invent 2019 - Moving to event-driven architectures (SVS308)](https://youtu.be/h46IquqjF3E) 

 **Related tools:** 
+  [Idempotency with AWS Lambda Powertools (Java)](https://docs.powertools.aws.dev/lambda/java/utilities/idempotency/) 
+  [Idempotency with AWS Lambda Powertools (Python)](https://docs.powertools.aws.dev/lambda/python/latest/utilities/idempotency/) 
+  [AWS Lambda Powertools GitHub page](https://github.com/aws-powertools/) 

# Design interactions in a distributed system to mitigate or withstand failures
<a name="design-interactions-in-a-distributed-system-to-mitigate-or-withstand-failures"></a>

 Distributed systems rely on communications networks to interconnect components (such as servers or services). Your workload must operate reliably despite data loss or latency over these networks. Components of the distributed system must operate in a way that does not negatively impact other components or the workload. These best practices allow workloads to withstand stresses or failures, more quickly recover from them, and mitigate the impact of such impairments. The result is improved mean time to recovery (MTTR). 

 These best practices prevent failures and improve mean time between failures (MTBF). 

**Topics**
+ [

# REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into soft dependencies
](rel_mitigate_interaction_failure_graceful_degradation.md)
+ [

# REL05-BP02 Throttle requests
](rel_mitigate_interaction_failure_throttle_requests.md)
+ [

# REL05-BP03 Control and limit retry calls
](rel_mitigate_interaction_failure_limit_retries.md)
+ [

# REL05-BP04 Fail fast and limit queues
](rel_mitigate_interaction_failure_fail_fast.md)
+ [

# REL05-BP05 Set client timeouts
](rel_mitigate_interaction_failure_client_timeouts.md)
+ [

# REL05-BP06 Make systems stateless where possible
](rel_mitigate_interaction_failure_stateless.md)
+ [

# REL05-BP07 Implement emergency levers
](rel_mitigate_interaction_failure_emergency_levers.md)

# REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into soft dependencies
<a name="rel_mitigate_interaction_failure_graceful_degradation"></a>

Application components should continue to perform their core function even if dependencies become unavailable. They might be serving slightly stale data, alternate data, or even no data. This ensures overall system function is only minimally impeded by localized failures while delivering the central business value.

 **Desired outcome:** When a component's dependencies are unhealthy, the component itself can still function, although in a degraded manner. Failure modes of components should be seen as normal operation. Workflows should be designed in such a way that such failures do not lead to complete failure or at least to predictable and recoverable states. 

 **Common anti-patterns:** 
+  Not identifying the core business functionality needed. Not testing that components are functional even during dependency failures. 
+  Serving no data on errors or when only one out of multiple dependencies is unavailable and partial results can still be returned. 
+  Creating an inconsistent state when a transaction partially fails. 
+  Not having an alternative way to access a central parameter store. 
+  Invalidating or emptying local state as a result of a failed refresh without considering the consequences of doing so. 

 **Benefits of establishing this best practice:** Graceful degradation improves the availability of the system as a whole and maintains the functionality of the most important functions even during failures. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implementing graceful degradation helps minimize the impact of dependency failures on component function. Ideally, a component detects dependency failures and works around them in a way that minimally impacts other components or customers. 

 Architecting for graceful degradation means considering potential failure modes during dependency design. For each failure mode, have a way to deliver most or at least the most critical functionality of the component to callers or customers. These considerations can become additional requirements that can be tested and verified. Ideally, a component is able to perform its core function in an acceptable manner even when one or multiple dependencies fail. 

 This is as much a business discussion as a technical one. All business requirements are important and should be fulfilled if possible. However, it still makes sense to ask what should happen when not all of them can be fulfilled. A system can be designed to be available and consistent, but under circumstances where one requirement must be dropped, which one is more important? For payment processing, it might be consistency. For a real-time application, it might be availability. For a customer facing website, the answer may depend on customer expectations. 

 What this means depends on the requirements of the component and what should be considered its core function. For example: 
+  An ecommerce website might display data from multiple different systems like personalized recommendations, highest ranked products, and status of customer orders on the landing page. When one upstream system fails, it still makes sense to display everything else instead of showing an error page to a customer. 
+  A component performing batch writes can still continue processing a batch if one of the individual operations fails. It should be simple to implement a retry mechanism. This can be done by returning information on which operations succeeded, which failed, and why they failed to the caller, or putting failed requests into a dead letter queue to implement asynchronous retries. Information about failed operations should be logged as well. 
+  A system that processes transactions must verify that either all or no individual updates are executed. For distributed transactions, the saga pattern can be used to roll back previous operations in case a later operation of the same transaction fails. Here, the core function is maintaining consistency. 
+  Time critical systems should be able to deal with dependencies not responding in a timely manner. In these cases, the circuit breaker pattern can be used. When responses from a dependency start timing out, the system can switch to a closed state where no additional call are made. 
+  An application may read parameters from a parameter store. It can be useful to create container images with a default set of parameters and use these in case the parameter store is unavailable. 

 Note that the pathways taken in case of component failure need to be tested and should be significantly simpler than the primary pathway. Generally, [fallback strategies should be avoided](https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/). 

## Implementation steps
<a name="implementation-steps"></a>

 Identify external and internal dependencies. Consider what kinds of failures can occur in them. Think about ways that minimize negative impact on upstream and downstream systems and customers during those failures. 

 The following is a list of dependencies and how to degrade gracefully when they fail: 

1.  **Partial failure of dependencies:** A component may make multiple requests to downstream systems, either as multiple requests to one system or one request to multiple systems each. Depending on the business context, different ways of handling for this may be appropriate (for more detail, see previous examples in Implementation guidance). 

1.  **A downstream system is unable to process requests due to high load:** If requests to a downstream system are consistently failing, it does not make sense to continue retrying. This may create additional load on an already overloaded system and make recovery more difficult. The circuit breaker pattern can be utilized here, which monitors failing calls to a downstream system. If a high number of calls are failing, it will stop sending more requests to the downstream system and only occasionally let calls through to test whether the downstream system is available again. 

1.  **A parameter store is unavailable:** To transform a parameter store, soft dependency caching or sane defaults included in container or machine images may be used. Note that these defaults need to be kept up-to-date and included in test suites. 

1.  **A monitoring service or other non-functional dependency is unavailable:** If a component is intermittently unable to send logs, metrics, or traces to a central monitoring service, it is often best to still execute business functions as usual. Silently not logging or pushing metrics for a long time is often not acceptable. Also, some use cases may require complete auditing entries to fulfill compliance requirements. 

1.  **A primary instance of a relational database may be unavailable:** Amazon Relational Database Service, like almost all relational databases, can only have one primary writer instance. This creates a single point of failure for write workloads and makes scaling more difficult. This can partially be mitigated by using a Multi-AZ configuration for high availability or Amazon Aurora Serverless for better scaling. For very high availability requirements, it can make sense to not rely on the primary writer at all. For queries that only read, read replicas can be used, which provide redundancy and the ability to scale out, not just up. Writes can be buffered, for example in an Amazon Simple Queue Service queue, so that write requests from customers can still be accepted even if the primary is temporarily unavailable. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+  [Amazon API Gateway: Throttle API Requests for Better Throughput](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html) 
+  [CircuitBreaker (summarizes Circuit Breaker from “Release It\$1” book)](https://martinfowler.com/bliki/CircuitBreaker.html) 
+  [Error Retries and Exponential Backoff in AWS](https://docs.aws.amazon.com/general/latest/gr/api-retries.html) 
+  [Michael Nygard “Release It\$1 Design and Deploy Production-Ready Software”](https://pragprog.com/titles/mnee2/release-it-second-edition/) 
+  [The Amazon Builders' Library: Avoiding fallback in distributed systems](https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems) 
+  [The Amazon Builders' Library: Avoiding insurmountable queue backlogs](https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs) 
+  [The Amazon Builders' Library: Caching challenges and strategies](https://aws.amazon.com/builders-library/caching-challenges-and-strategies/) 
+  [The Amazon Builders' Library: Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) 

 **Related videos:** 
+  [Retry, backoff, and jitter: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)](https://youtu.be/sKRdemSirDM?t=1884) 

# REL05-BP02 Throttle requests
<a name="rel_mitigate_interaction_failure_throttle_requests"></a>

Throttle requests to mitigate resource exhaustion due to unexpected increases in demand. Requests below throttling rates are processed while those over the defined limit are rejected with a return message indicating the request was throttled. 

 **Desired outcome:** Large volume spikes either from sudden customer traffic increases, flooding attacks, or retry storms are mitigated by request throttling, allowing workloads to continue normal processing of supported request volume. 

 **Common anti-patterns:** 
+  API endpoint throttles are not implemented or are left at default values without considering expected volumes. 
+  API endpoints are not load tested or throttling limits are not tested. 
+  Throttling request rates without considering request size or complexity. 
+  Testing maximum request rates or maximum request size, but not testing both together. 
+  Resources are not provisioned to the same limits established in testing. 
+  Usage plans have not been configured or considered for application to application (A2A) API consumers. 
+  Queue consumers that horizontally scale do not have maximum concurrency settings configured. 
+  Rate limiting on a per IP address basis has not been implemented. 

 **Benefits of establishing this best practice:** Workloads that set throttle limits are able to operate normally and process accepted request load successfully under unexpected volume spikes. Sudden or sustained spikes of requests to APIs and queues are throttled and do not exhaust request processing resources. Rate limits throttle individual requestors so that high volumes of traffic from a single IP address or API consumer will not exhaust resources impact other consumers. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Services should be designed to process a known capacity of requests; this capacity can be established through load testing. If request arrival rates exceed limits, the appropriate response signals that a request has been throttled. This allows the consumer to handle the error and retry later. 

 When your service requires a throttling implementation, consider implementing the token bucket algorithm, where a token counts for a request. Tokens are refilled at a throttle rate per second and emptied asynchronously by one token per request. 

![\[Diagram describing the token bucket algorithm.\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/token-bucket-algorithm.png)


 [Amazon API Gateway](https://aws.amazon.com/api-gateway/) implements the token bucket algorithm according to account and region limits and can be configured per-client with usage plans. Additionally, [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) and [Amazon Kinesis](https://aws.amazon.com/kinesis/) can buffer requests to smooth out the request rate, and allow higher throttling rates for requests that can be addressed. Finally, you can implement rate limiting with [AWS WAF](https://aws.amazon.com/waf/) to throttle specific API consumers that generate unusually high load. 

## Implementation steps
<a name="implementation-steps"></a>

 You can configure API Gateway with throttling limits for your APIs and return `429 Too Many Requests` errors when limits are exceeded. You can use AWS WAF with your AWS AppSync and API Gateway endpoints to enable rate limiting on a per IP address basis. Additionally, where your system can tolerate asynchronous processing, you can put messages into a queue or stream to speed up responses to service clients, which allows you to burst to higher throttle rates. 

 With asynchronous processing, when you’ve configured Amazon SQS as an event source for AWS Lambda, you can [configure maximum concurrency](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-max-concurrency) to avoid high event rates from consuming available account concurrent execution quota needed for other services in your workload or account. 

 While API Gateway provides a managed implementation of the token bucket, in cases where you cannot use API Gateway, you can take advantage of language specific open-source implementations (see related examples in Resources) of the token bucket for your services. 
+  Understand and configure [API Gateway throttling limits](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html) at the account level per region, API per stage, and API key per usage plan levels. 
+  Apply [AWS WAF rate limiting rules](https://aws.amazon.com/blogs/security/three-most-important-aws-waf-rate-based-rules/) to API Gateway and AWS AppSync endpoints to protect against floods and block malicious IPs. Rate limiting rules can also be configured on AWS AppSync API keys for A2A consumers. 
+  Consider whether you require more throttling control than rate limiting for AWS AppSync APIs, and if so, configure an API Gateway in front of your AWS AppSync endpoint. 
+  When Amazon SQS queues are set up as triggers for Lambda queue consumers, set [maximum concurrency](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-max-concurrency) to a value that processes enough to meet your service level objectives but does not consume concurrency limits impacting other Lambda functions. Consider setting reserved concurrency on other Lambda functions in the same account and region when you consume queues with Lambda. 
+  Use API Gateway with native service integrations to Amazon SQS or Kinesis to buffer requests. 
+  If you cannot use API Gateway, look at language specific libraries to implement the token bucket algorithm for your workload. Check the examples section and do your own research to find a suitable library. 
+  Test limits that you plan to set, or that you plan to allow to be increased, and document the tested limits. 
+  Do not increase limits beyond what you establish in testing. When increasing a limit, verify that provisioned resources are already equivalent to or greater than those in test scenarios before applying the increase. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL04-BP03 Do constant work](rel_prevent_interaction_failure_constant_work.md) 
+  [REL05-BP03 Control and limit retry calls](rel_mitigate_interaction_failure_limit_retries.md) 

 **Related documents:** 
+  [Amazon API Gateway: Throttle API Requests for Better Throughput](https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html) 
+ [AWS WAF: Rate-based rule statement ](https://docs.aws.amazon.com/waf/latest/developerguide/waf-rule-statement-type-rate-based.html)
+ [ Introducing maximum concurrency of AWS Lambda when using Amazon SQS as an event source ](https://aws.amazon.com/blogs/compute/introducing-maximum-concurrency-of-aws-lambda-functions-when-using-amazon-sqs-as-an-event-source/)
+ [AWS Lambda: Maximum Concurrency ](https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-max-concurrency)

 **Related examples:** 
+ [ The three most important AWS WAF rate-based rules ](https://aws.amazon.com/blogs/security/three-most-important-aws-waf-rate-based-rules/)
+ [ Java Bucket4j ](https://github.com/bucket4j/bucket4j)
+ [ Python token-bucket ](https://pypi.org/project/token-bucket/)
+ [ Node token-bucket ](https://www.npmjs.com/package/tokenbucket)
+ [ .NET System Threading Rate Limiting ](https://www.nuget.org/packages/System.Threading.RateLimiting)

 **Related videos:** 
+ [ Implementing GraphQL API security best practices with AWS AppSync](https://www.youtube.com/watch?v=1ASMLeJ_15U)

 **Related tools:** 
+ [ Amazon API Gateway ](https://aws.amazon.com/api-gateway/)
+ [AWS AppSync](https://aws.amazon.com/appsync/)
+ [ Amazon SQS ](https://aws.amazon.com/sqs/)
+ [ Amazon Kinesis ](https://aws.amazon.com/kinesis/)
+ [AWS WAF](https://aws.amazon.com/waf/)
+ [ Virtual Waiting Room on AWS](https://aws.amazon.com/solutions/implementations/virtual-waiting-room-on-aws/)

# REL05-BP03 Control and limit retry calls
<a name="rel_mitigate_interaction_failure_limit_retries"></a>

Use exponential backoff to retry requests at progressively longer intervals between each retry. Introduce jitter between retries to randomize retry intervals. Limit the maximum number of retries.

 **Desired outcome:** Typical components in a distributed software system include servers, load balancers, databases, and DNS servers. During normal operation, these components can respond to requests with errors that are temporary or limited, and also errors that would be persistent regardless of retries. When clients make requests to services, the requests consume resources including memory, threads, connections, ports, or any other limited resources. Controlling and limiting retries is a strategy to release and minimize consumption of resources so that system components under strain are not overwhelmed. 

 When client requests time out or receive error responses, they should determine whether or not to retry. If they do retry, they do so with exponential backoff with jitter and a maximum retry value. As a result, backend services and processes are given relief from load and time to self-heal, resulting in faster recovery and successful request servicing. 

 **Common anti-patterns:** 
+  Implementing retries without adding exponential backoff, jitter, and maximum retry values. Backoff and jitter help avoid artificial traffic spikes due to unintentionally coordinated retries at common intervals. 
+  Implementing retries without testing their effects or assuming retries are already built into an SDK without testing retry scenarios. 
+  Failing to understand published error codes from dependencies, leading to retrying all errors, including those with a clear cause that indicates lack of permission, configuration error, or another condition that predictably will not resolve without manual intervention. 
+  Not addressing observability practices, including monitoring and alerting on repeated service failures so that underlying issues are made known and can be addressed. 
+  Developing custom retry mechanisms when built-in or third-party retry capabilities suffice. 
+  Retrying at multiple layers of your application stack in a manner which compounds retry attempts further consuming resources in a retry storm. Be sure to understand how these errors affect your application the dependencies you rely on, then implement retries at only one level. 
+  Retrying service calls that are not idempotent, causing unexpected side effects like duplicated results. 

 **Benefits of establishing this best practice:** Retries help clients acquire desired results when requests fail but also consume more of a server’s time to get the successful responses they want. When failures are rare or transient, retries work well. When failures are caused by resource overload, retries can make things worse. Adding exponential backoff with jitter to client retries allows servers to recover when failures are caused by resource overload. Jitter avoids alignment of requests into spikes, and backoff diminishes load escalation caused by adding retries to normal request load. Finally, it’s important to configure a maximum number of retries or elapsed time to avoid creating backlogs that produce metastable failures. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Control and limit retry calls. Use exponential backoff to retry after progressively longer intervals. Introduce jitter to randomize retry intervals and limit the maximum number of retries. 

 Some AWS SDKs implement retries and exponential backoff by default. Use these built-in AWS implementations where applicable in your workload. Implement similar logic in your workload when calling services that are idempotent and where retries improve your client availability. Decide what the timeouts are and when to stop retrying based on your use case. Build and exercise testing scenarios for those retry use cases. 

## Implementation steps
<a name="implementation-steps"></a>
+  Determine the optimal layer in your application stack to implement retries for the services your application relies on. 
+  Be aware of existing SDKs that implement proven retry strategies with exponential backoff and jitter for your language of choice, and favor these over writing your own retry implementations. 
+  Verify that [services are idempotent](https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/) before implementing retries. Once retries are implemented, be sure they are both tested and regularly exercise in production. 
+  When calling AWS service APIs, use the [AWS SDKs](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html) and [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html) and understand the retry configuration options. Determine if the defaults work for your use case, test, and adjust as needed. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL04-BP04 Make mutating operations idempotent](rel_prevent_interaction_failure_idempotent.md) 
+  [REL05-BP02 Throttle requests](rel_mitigate_interaction_failure_throttle_requests.md) 
+  [REL05-BP04 Fail fast and limit queues](rel_mitigate_interaction_failure_fail_fast.md) 
+  [REL05-BP05 Set client timeouts](rel_mitigate_interaction_failure_client_timeouts.md) 
+  [REL11-BP01 Monitor all components of the workload to detect failures](rel_withstand_component_failures_monitoring_health.md) 

 **Related documents:** 
+  [Error Retries and Exponential Backoff in AWS](https://docs.aws.amazon.com/general/latest/gr/api-retries.html) 
+  [The Amazon Builders' Library: Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) 
+ [ Exponential Backoff and Jitter ](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/)
+ [ Making retries safe with idempotent APIs ](https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/)

 **Related examples:** 
+ [ Spring Retry ](https://github.com/spring-projects/spring-retry)
+ [ Resilience4j Retry ](https://resilience4j.readme.io/docs/retry)

 **Related videos:** 
+  [Retry, backoff, and jitter: AWS re:Invent 2019: Introducing The Amazon Builders’ Library (DOP328)](https://youtu.be/sKRdemSirDM?t=1884) 

 **Related tools:** 
+ [AWS SDKs and Tools: Retry behavior ](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html)
+ [AWS Command Line Interface: AWS CLI retries ](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html)

# REL05-BP04 Fail fast and limit queues
<a name="rel_mitigate_interaction_failure_fail_fast"></a>

When a service is unable to respond successfully to a request, fail fast. This allows resources associated with a request to be released, and permits a service to recover if it’s running out of resources. Failing fast is a well-established software design pattern that can be leveraged to build highly reliable workloads in the cloud. Queuing is also a well-established enterprise integration pattern that can smooth load and allow clients to release resources when asynchronous processing can be tolerated. When a service is able to respond successfully under normal conditions but fails when the rate of requests is too high, use a queue to buffer requests. However, do not allow a buildup of long queue backlogs that can result in processing stale requests that a client has already given up on.

 **Desired outcome:** When systems experience resource contention, timeouts, exceptions, or grey failures that make service level objectives unachievable, fail fast strategies allow for faster system recovery. Systems that must absorb traffic spikes and can accommodate asynchronous processing can improve reliability by allowing clients to quickly release requests by using queues to buffer requests to backend services. When buffering requests to queues, queue management strategies are implemented to avoid insurmountable backlogs. 

 **Common anti-patterns:** 
+  Implementing message queues but not configuring dead letter queues (DLQ) or alarms on DLQ volumes to detect when a system is in failure. 
+  Not measuring the age of messages in a queue, a measurement of latency to understand when queue consumers are falling behind or erroring out causing retrying. 
+  Not clearing backlogged messages from a queue, when there is no value in processing these messages if the business need no longer exists. 
+  Configuring first in first out (FIFO) queues when last in first out (LIFO) queues would better serve client needs, for example when strict ordering is not required and backlog processing is delaying all new and time sensitive requests resulting in all clients experiencing breached service levels. 
+  Exposing internal queues to clients instead of exposing APIs that manage work intake and place requests into internal queues. 
+  Combining too many work request types into a single queue which can exacerbate backlog conditions by spreading resource demand across request types. 
+  Processing complex and simple requests in the same queue, despite needing different monitoring, timeouts and resource allocations. 
+  Not validating inputs or using assertions to implement fail fast mechanisms in software that bubble up exceptions to higher level components that can handle errors gracefully. 
+  Not removing faulty resources from request routing, especially when failures are grey emitting both successes and failures due to crashing and restarting, intermittent dependency failure, reduced capacity, or network packet loss. 

 **Benefits of establishing this best practice:** Systems that fail fast are easier to debug and fix, and often expose issues in coding and configuration before releases are published into production. Systems that incorporate effective queueing strategies provide greater resilience and reliability to traffic spikes and intermittent system fault conditions. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Fail fast strategies can be coded into software solutions as well as configured into infrastructure. In addition to failing fast, queues are a straightforward yet powerful architectural technique to decouple system components smooth load. [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) provides capabilities to monitor for and alarm on failures. Once a system is known to be failing, mitigation strategies can be invoked, including failing away from impaired resources. When systems implement queues with [Amazon SQS](https://aws.amazon.com/sqs/) and other queue technologies to smooth load, they must consider how to manage queue backlogs, as well as message consumption failures. 

## Implementation steps
<a name="implementation-steps"></a>
+  Implement programmatic assertions or specific metrics in your software and use them to explicitly alert on system issues. Amazon CloudWatch helps you create metrics and alarms based on application log pattern and SDK instrumentation. 
+  Use CloudWatch metrics and alarms to fail away from impaired resources that are adding latency to processing or repeatedly failing to process requests. 
+  Use asynchronous processing by designing APIs to accept requests and append requests to internal queues using Amazon SQS and then respond to the message-producing client with a success message so the client can release resources and move on with other work while backend queue consumers process requests. 
+  Measure and monitor for queue processing latency by producing a CloudWatch metric each time you take a message off a queue by comparing now to message timestamp. 
+  When failures prevent successful message processing or traffic spikes in volumes that cannot be processed within service level agreements, sideline older or excess traffic to a spillover queue. This allows priority processing of new work, and older work when capacity is available. This technique is an approximation of LIFO processing and allows normal system processing for all new work. 
+  Use dead letter or redrive queues to move messages that can’t be processed out of the backlog into a location that can be researched and resolved later 
+  Either retry or, when tolerable, drop old messages by comparing now to the message timestamp and discarding messages that are no longer relevant to the requesting client. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL04-BP02 Implement loosely coupled dependencies](rel_prevent_interaction_failure_loosely_coupled_system.md) 
+  [REL05-BP02 Throttle requests](rel_mitigate_interaction_failure_throttle_requests.md) 
+  [REL05-BP03 Control and limit retry calls](rel_mitigate_interaction_failure_limit_retries.md) 
+  [REL06-BP02 Define and calculate metrics (Aggregation)](rel_monitor_aws_resources_notification_aggregation.md) 
+  [REL06-BP07 Monitor end-to-end tracing of requests through your system](rel_monitor_aws_resources_end_to_end.md) 

 **Related documents:** 
+ [ Avoiding insurmountable queue backlogs ](https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs/)
+  [Fail Fast](https://www.martinfowler.com/ieeeSoftware/failFast.pdf) 
+ [ How can I prevent an increasing backlog of messages in my Amazon SQS queue? ](https://repost.aws/knowledge-center/sqs-message-backlog)
+ [ Elastic Load Balancing: Zonal Shift ](https://docs.aws.amazon.com/elasticloadbalancing/latest/network/zonal-shift.html)
+ [ Amazon Application Recovery Controller: Routing control for traffic failover ](https://docs.aws.amazon.com/r53recovery/latest/dg/getting-started-routing-controls.html)

 **Related examples:** 
+ [ Enterprise Integration Patterns: Dead Letter Channel ](https://www.enterpriseintegrationpatterns.com/patterns/messaging/DeadLetterChannel.html)

 **Related videos:** 
+  [AWS re:Invent 2022 - Operating highly available Multi-AZ applications](https://www.youtube.com/watch?v=mwUV5skJJ0s) 

 **Related tools:** 
+ [ Amazon SQS ](https://aws.amazon.com/sqs/)
+ [ Amazon MQ ](https://aws.amazon.com/amazon-mq/)
+ [AWS IoT Core](https://aws.amazon.com/iot-core/)
+ [ Amazon CloudWatch ](https://aws.amazon.com/cloudwatch/)

# REL05-BP05 Set client timeouts
<a name="rel_mitigate_interaction_failure_client_timeouts"></a>

Set timeouts appropriately on connections and requests, verify them systematically, and do not rely on default values as they are not aware of workload specifics.

 **Desired outcome:** Client timeouts should consider the cost to the client, server, and workload associated with waiting for requests that take abnormal amounts of time to complete. Since it is not possible to know the exact cause of any timeout, clients must use knowledge of services to develop expectations of probable causes and appropriate timeouts 

 Client connections time out based on configured values. After encountering a timeout, clients make decisions to back off and retry or open a [circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html). These patterns avoid issuing requests that may exacerbate an underlying error condition. 

 **Common anti-patterns:** 
+  Not being aware of system timeouts or default timeouts. 
+  Not being aware of normal request completion timing. 
+  Not being aware of possible causes for requests to take abnormally long to complete, or the costs to client, service, or workload performance associated with waiting on these completions. 
+  Not being aware of the probability of impaired network causing a request to fail only once timeout is reached, and the costs to client and workload performance for not adopting a shorter timeout. 
+  Not testing timeout scenarios both for connections and requests. 
+  Setting timeouts too high, which can result in long wait times and increase resource utilization. 
+  Setting timeouts too low, resulting in artificial failures. 
+  Overlooking patterns to deal with timeout errors for remote calls like circuit breakers and retries. 
+  Not considering monitoring for service call error rates, service level objectives for latency, and latency outliers. These metrics can provide insight to aggressive or permissive timeouts 

 **Benefits of establishing this best practice:** Remote call timeouts are configured and systems are designed to handle timeouts gracefully so that resources are conserved when remote calls respond abnormally slow and timeout errors are handled gracefully by service clients. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Set both a connection timeout and a request timeout on any service dependency call and generally on any call across processes. Many frameworks offer built-in timeout capabilities, but be careful, as some have default values that are infinite or higher than acceptable for your service goals. A value that is too high reduces the usefulness of the timeout because resources continue to be consumed while the client waits for the timeout to occur. A value that is too low can generate increased traffic on the backend and increased latency because too many requests are retried. In some cases, this can lead to complete outages because all requests are being retried. 

 Consider the following when determining timeout strategies: 
+  Requests may take longer than normal to process because of their content, impairments in a target service, or a networking partition failure. 
+  Requests with abnormally expensive content could consume unnecessary server and client resources. In this case, timing out these requests and not retrying can preserve resources. Services should also protect themselves from abnormally expensive content with throttles and server-side timeouts. 
+  Requests that take abnormally long due to a service impairment can be timed out and retried. Consideration should be given to service costs for the request and retry, but if the cause is a localized impairment, a retry is not likely to be expensive and will reduce client resource consumption. The timeout may also release server resources depending on the nature of the impairment. 
+  Requests that take a long time to complete because the request or response has failed to be delivered by the network can be timed out and retried. Because the request or response was not delivered, failure would have been the outcome regardless of the length of timeout. Timing out in this case will not release server resources, but it will release client resources and improve workload performance. 

 Take advantage of well-established design patterns like retries and circuit breakers to handle timeouts gracefully and support fail-fast approaches. [AWS SDKs](https://docs.aws.amazon.com/index.html#sdks) and [AWS CLI](https://aws.amazon.com/cli/) allow for configuration of both connection and request timeouts and for retries with exponential backoff and jitter. [AWS Lambda](https://aws.amazon.com/lambda/) functions support configuration of timeouts, and with [AWS Step Functions](https://aws.amazon.com/step-functions/), you can build low code circuit breakers that take advantage of pre-built integrations with AWS services and SDKs. [AWS App Mesh](https://aws.amazon.com/app-mesh/) Envoy provides timeout and circuit breaker capabilities. 

## Implementation steps
<a name="implementation-steps"></a>
+  Configure timeouts on remote service calls and take advantage of built-in language timeout features or open source timeout libraries. 
+  When your workload makes calls with an AWS SDK, review the documentation for language specific timeout configuration. 
  + [ Python ](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html)
  + [ PHP ](https://docs.aws.amazon.com/aws-sdk-php/v3/api/class-Aws.DefaultsMode.Configuration.html)
  + [ .NET ](https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/retries-timeouts.html)
  + [ Ruby ](https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/timeout-duration.html)
  + [ Java ](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/best-practices.html#bestpractice5)
  + [ Go ](https://aws.github.io/aws-sdk-go-v2/docs/configuring-sdk/retries-timeouts/#timeouts)
  + [ Node.js ](https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Config.html)
  + [ C\$1\$1 ](https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/client-config.html)
+  When using AWS SDKs or AWS CLI commands in your workload, configure default timeout values by setting the AWS [configuration defaults](https://docs.aws.amazon.com/sdkref/latest/guide/feature-smart-config-defaults.html) for `connectTimeoutInMillis` and `tlsNegotiationTimeoutInMillis`. 
+  Apply [command line options](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-options.html) `cli-connect-timeout` and `cli-read-timeout` to control one-off AWS CLI commands to AWS services. 
+  Monitor remote service calls for timeouts, and set alarms on persistent errors so that you can proactively handle error scenarios. 
+  Implement [CloudWatch Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/working_with_metrics.html) and [CloudWatch anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) on call error rates, service level objectives for latency, and latency outliers to provide insight into managing overly aggressive or permissive timeouts. 
+  Configure timeouts on [Lambda functions](https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html#configuration-timeout-console). 
+  API Gateway clients must implement their own retries when handling timeouts. API Gateway supports a [50 millisecond to 29 second integration timeout](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html#api-gateway-execution-service-limits-table) for downstream integrations and does not retry when integration requests timeout. 
+  Implement the [circuit breaker](https://martinfowler.com/bliki/CircuitBreaker.html) pattern to avoid making remote calls when they are timing out. Open the circuit to avoid failing calls and close the circuit when calls are responding normally. 
+  For container based workloads, review [App Mesh Envoy](https://docs.aws.amazon.com/app-mesh/latest/userguide/envoy.html) features to leverage built in timeouts and circuit breakers. 
+  Use AWS Step Functions to build low code circuit breakers for remote service calls, especially where calling AWS native SDKs and supported Step Functions integrations to simplify your workload. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL05-BP03 Control and limit retry calls](rel_mitigate_interaction_failure_limit_retries.md) 
+  [REL05-BP04 Fail fast and limit queues](rel_mitigate_interaction_failure_fail_fast.md) 
+  [REL06-BP07 Monitor end-to-end tracing of requests through your system](rel_monitor_aws_resources_end_to_end.md) 

 **Related documents:** 
+  [AWS SDK: Retries and Timeouts](https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/retries-timeouts.html) 
+  [The Amazon Builders' Library: Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) 
+ [ Amazon API Gateway quotas and important notes ](https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html)
+ [AWS Command Line Interface: Command line options ](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-options.html)
+ [AWS SDK for Java 2.x: Configure API Timeouts ](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/best-practices.html#bestpractice5)
+ [AWS Botocore using the config object and Config Reference ](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-the-config-object)
+ [AWS SDK for .NET: Retries and Timeouts ](https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/retries-timeouts.html)
+ [AWS Lambda: Configuring Lambda function options ](https://docs.aws.amazon.com/lambda/latest/dg/configuration-function-common.html)

 **Related examples:** 
+ [ Using the circuit breaker pattern with AWS Step Functions and Amazon DynamoDB ](https://aws.amazon.com/blogs/compute/using-the-circuit-breaker-pattern-with-aws-step-functions-and-amazon-dynamodb/)
+ [ Martin Fowler: CircuitBreaker ](https://martinfowler.com/bliki/CircuitBreaker.html?ref=wellarchitected)

 **Related tools:** 
+ [AWS SDKs ](https://docs.aws.amazon.com/index.html#sdks)
+ [AWS Lambda](https://aws.amazon.com/lambda/)
+ [ Amazon SQS ](https://aws.amazon.com/sqs/)
+ [AWS Step Functions](https://aws.amazon.com/step-functions/)
+ [AWS Command Line Interface](https://aws.amazon.com/cli/)

# REL05-BP06 Make systems stateless where possible
<a name="rel_mitigate_interaction_failure_stateless"></a>

 Systems should either not require state, or should offload state such that between different client requests, there is no dependence on locally stored data on disk and in memory. This allows servers to be replaced at will without causing an availability impact. 

 When users or services interact with an application, they often perform a series of interactions that form a session. A session is unique data for users that persists between requests while they use the application. A stateless application is an application that does not need knowledge of previous interactions and does not store session information. 

 Once designed to be stateless, you can then use serverless compute services, such as AWS Lambda or AWS Fargate. 

 In addition to server replacement, another benefit of stateless applications is that they can scale horizontally because any of the available compute resources (such as EC2 instances and AWS Lambda functions) can service any request. 

 **Benefits of establishing this best practice:** Systems that are designed to be stateless are more adaptable to horizontal scaling, making it possible to add or remove capacity based on fluctuating traffic and demand. They are also inherently resilient to failures and provide flexibility and agility in application development. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Make your applications stateless. Stateless applications allow horizontal scaling and are tolerant to the failure of an individual node. Analyze and understand the components of your application that maintain state within the architecture. This helps you assess the potential impact of transitioning to a stateless design. A stateless architecture decouples user data and offloads the session data. This provides the flexibility to scale each component independently to meet varying workload demands and optimize resource utilization. 

### Implementation steps
<a name="implementation-steps"></a>
+  Identify and understand the stateful components in your application. 
+  Decouple data by separating and managing user data from the core application logic. 
  +  [Amazon Cognito](https://aws.amazon.com/cognito/) can decouple user data from application code by using features, such as [identity pools](https://docs.aws.amazon.com/cognito/latest/developerguide/getting-started-with-identity-pools.html), [user pools](https://docs.aws.amazon.com/cognito/latest/developerguide/getting-started-with-cognito-user-pools.html), and [Amazon Cognito Sync](https://docs.aws.amazon.com/cognito/latest/developerguide/cognito-sync.html). 
  +  You can use [AWS Secrets Manager](https://aws.amazon.com/secrets-manager/) decouple user data by storing secrets in a secure, centralized location. This means that the application code doesn't need to store secrets, which makes it more secure. 
  +  Consider using [Amazon S3](https://aws.amazon.com/s3/) to store large, unstructured data, such as images and documents. Your application can retrieve this data when required, eliminating the need to store it in memory. 
  +  Use [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) to store information such as user profiles. Your application can query this data in near-real time. 
+  Offload session data to a database, cache, or external files. 
  +  [Amazon ElastiCache](https://aws.amazon.com/elasticache/), Amazon DynamoDB, [Amazon Elastic File System](https://aws.amazon.com/efs/) (Amazon EFS), and [Amazon MemoryDB](https://aws.amazon.com/memorydb/) are examples of AWS services that you can use to offload session data. 
+  Design a stateless architecture after you identify which state and user data need to be persisted with your storage solution of choice. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL11-BP03 Automate healing on all layers](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_auto_healing_system.html) 

 **Related documents:** 
+  [The Amazon Builders' Library: Avoiding fallback in distributed systems](https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems) 
+  [The Amazon Builders' Library: Avoiding insurmountable queue backlogs](https://aws.amazon.com/builders-library/avoiding-insurmountable-queue-backlogs) 
+  [The Amazon Builders' Library: Caching challenges and strategies](https://aws.amazon.com/builders-library/caching-challenges-and-strategies/) 
+  [Best Practices for Stateless Web Tier on AWS](https://docs.aws.amazon.com/whitepapers/latest/best-practices-wordpress/stateless-web-tier.html) 

# REL05-BP07 Implement emergency levers
<a name="rel_mitigate_interaction_failure_emergency_levers"></a>

 Emergency levers are rapid processes that can mitigate availability impact on your workload. 

 Emergency levers work by disabling, throttling, or changing the behavior of components or dependencies using known and tested mechanisms. This can alleviate workload impairments caused by resource exhaustion due to unexpected increases in demand and reduce the impact of failures in non-critical components within your workload. 

 **Desired outcome:** By implementing emergency levers, you can establish known-good processes to maintain the availability of critical components in your workload. The workload should degrade gracefully and continue to perform its business-critical functions during the activation of an emergency lever. For more detail on graceful degradation, see [REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into soft dependencies](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_mitigate_interaction_failure_graceful_degradation.html). 

 **Common anti-patterns:** 
+  Failure of non-critical dependencies impacts the availability of your core workload. 
+  Not testing or verifying critical component behavior during non-critical component impairment. 
+  No clear and deterministic criteria defined for activation or deactivation of an emergency lever. 

 **Benefits of establishing this best practice:** Implementing emergency levers can improve the availability of the critical components in your workload by providing your resolvers with established processes to respond to unexpected spikes in demand or failures of non-critical dependencies. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>
+  Identify critical components in your workload. 
+  Design and architect the critical components in your workload to withstand failure of non-critical components. 
+  Conduct testing to validate the behavior of your critical components during the failure of non-critical components. 
+  Define and monitor relevant metrics or triggers to initiate emergency lever procedures. 
+  Define the procedures (manual or automated) that comprise the emergency lever. 

### Implementation steps
<a name="implementation-steps"></a>
+  Identify business-critical components in your workload. 
  +  Each technical component in your workload should be mapped to its relevant business function and ranked as critical or non-critical. For examples of critical and non-critical functionality at Amazon, see [Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over 84K Requests Per Second](https://community.aws/posts/how-search-uses-chaos-engineering). 
  +  This is both a technical and business decision, and varies by organization and workload. 
+  Design and architect the critical components in your workload to withstand failure of non-critical components. 
  +  During dependency analysis, consider all potential failure modes, and verify that your emergency lever mechanisms deliver the critical functionality to downstream components. 
+  Conduct testing to validate the behavior of your critical components during activation of your emergency levers. 
  +  Avoid bimodal behavior. For more detail, see [REL11-BP05 Use static stability to prevent bimodal behavior](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_static_stability.html). 
+  Define, monitor, and alert on relevant metrics to initiate the emergency lever procedure. 
  +  Finding the right metrics to monitor depends on your workload. Some example metrics are latency or the number of failed request to a dependency. 
+  Define the procedures, manual or automated, that comprise the emergency lever. 
  +  This may include mechanisms such as [load shedding](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/), [throttling requests](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_mitigate_interaction_failure_throttle_requests.html), or implementing [graceful degradation](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_mitigate_interaction_failure_graceful_degradation.html). 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [REL05-BP01 Implement graceful degradation to transform applicable hard dependencies into soft dependencies](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_mitigate_interaction_failure_graceful_degradation.html) 
+  [REL05-BP02 Throttle requests](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_mitigate_interaction_failure_throttle_requests.html) 
+  [REL11-BP05 Use static stability to prevent bimodal behavior](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_withstand_component_failures_static_stability.html) 

 **Related documents:** 
+ [ Automating safe, hands-off deployments ](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/)
+  [Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over 84K Requests Per Second](https://community.aws/posts/how-search-uses-chaos-engineering) 

 **Related videos:** 
+ [AWS re:Invent 2020: Reliability, consistency, and confidence through immutability](https://www.youtube.com/watch?v=jUSYnRztttY)