

 This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

# Designing highly available distributed systems on AWS
<a name="designing-highly-available-distributed-systems-on-aws"></a>

 The previous sections have been mostly about the theoretical availability of workloads and what they can achieve. They are an important set of concepts to keep in mind as you build distributed systems. They will help inform your dependency selection process and how you implement redundancy. 

 We’ve also looked at the relationship of MTTD, MTTR, and MTBF to availability. This section will introduce practical guidance based on the previous theory. In short, engineering workloads for high availability aims to increase the MTBF and decrease the MTTR as well as the MTTD. 

 Although eliminating all failures would be ideal, it's not realistic. In large distributed systems with deeply stacked dependencies, failures are going to occur. “Everything fails all of the time” (see Werner Vogels, CTO, Amazon.com, [10 Lessons from 10 Years of Amazon Web Services](https://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html).) and “you can’t legislate against failure [so] focus on fast detection and response.” (see Chris Pinkham, founding member, Amazon EC2 team, [ARC335 Designing for failure: Architecting resilient systems on AWS](https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Designing_for_failure_Architecting_resilient_systems_on_AWS_ARC335-R1.pdf)) 

 What this means is that frequently you don't have control over whether failure happens. What you can control is how quickly you detect the failure and do something about it. So, while increasing MTBF is still an important component of high availability, the most significant changes customers have within their control is reducing MTTD and MTTR. 

**Topics**
+ [Reducing MTTD](reducing-mttd.md)
+ [Reducing MTTR](reducing-mttr.md)
+ [Increasing MTBF](increasing-mtbf.md)

# Reducing MTTD
<a name="reducing-mttd"></a>

 Reducing the MTTD of a failure means discovering the failure as quickly as possible. Shortening the MTTD is based on observability, or how you've instrumented your workload to understand its state. Customers should monitor their *Customer Experience* metrics in their workload's critical subsystems as a way to proactively identify when a problem occurs (refer to [Appendix 1 – MTTD and MTTR critical metrics](appendix-1-mttd-and-mttr-critical-metrics.md) for more information about these metrics. ). Customers can use [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to create *canaries* that monitor your APIs and consoles to proactively measure the user experience. There are a number of other health check mechanisms that can be used to minimize the MTTD, such as [Elastic Load Balancing (ELB) health checks](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-add-elb-healthcheck.html), [Amazon Route 53 health checks](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/health-checks-types.html), and more. (See [Amazon Builders' Library – Implementing health checks](https://aws.amazon.com/builders-library/implementing-health-checks/).) 

 Your monitoring also needs to be able to detect partial failures of both the system as a whole and in your individual subsystems. Your availability, failure, and latency metrics should use the dimensionality of your fault isolation boundaries as [CloudWatch metric dimensions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_concepts.html#Dimension). For example, consider a single EC2 instance that is part of a cell-based architecture, in the use1-az1 AZ, in the us-east-1 Region, that is part of the workload’s update API that is part of its control plane subsystem. When the server pushes its metrics, it can use its instance id, AZ, Region, API name, and subsystem name as dimensions. This allows you to have observability and set alarms across each of these dimensions to detect failure. 

# Reducing MTTR
<a name="reducing-mttr"></a>

 After a failure is discovered, the remainder of the MTTR time is the actual repair or mitigation of impact. To repair or mitigate a failure, you have to know what's wrong. There are two key groups of metrics that provide insight during this phase: 1/*Impact Assessment* metrics and 2/*Operational Health* metrics. The first group tells you the scope of impact during a failure, measuring the number or percentage of the customers, resources, or workloads impacted. The second group helps identify *why* there is impact. After the why is discovered, operators and automation can respond to and resolve the failure. Refer to [Appendix 1 – MTTD and MTTR critical metrics](appendix-1-mttd-and-mttr-critical-metrics.md) for more information about these metrics. 

**Rule 9**  
Observability and instrumentation are critical for reducing MTTD and MTTR. 

## Route around failure
<a name="route-around-failure"></a>

 The fastest approach to mitigating impact is through fail-fast subsystems that route around failure. This approach uses redundancy to reduce MTTR by quickly shifting the work of a failed subsystem to a spare. The redundancy can range from software processes, to EC2 instances, to multiple AZs, to multiple Regions. 

 Spare subsystems can reduce the MTTR down to almost zero. The recovery time is only what it takes to reroute the work to the stand-by spare. This often happens with minimal latency and allows the work to complete within the defined SLA, maintaining the availability of the system. This produces MTTRs that are experienced as slight, perhaps even imperceptible, delays, rather than prolonged periods of unavailability. 

 For example, if your service utilizes EC2 instances behind an Application Load Balancer (ALB), you can configure health checks at an interval as small as five seconds and require only two failed health checks before a target is marked as unhealthy. This means that within 10 seconds, you can detect a failure and stop sending traffic to the unhealthy host. In this case, the MTTR is effectively the same as the MTTD since as soon as the failure is detected, it is also mitigated. 

 This is what *high-availability* or *continuous-availability* workloads are trying to achieve. We want to quickly route around failure in the workload by quickly detecting failed subsystems, marking them as failed, stop sending traffic to them, and instead send traffic to a redundant subsystem. 

 Note that using this kind of fail-fast mechanism makes your workload very sensitive to transient errors. In the example provided, ensure that your load balancer health checks are performing *shallow* or [https://aws.amazon.com/builders-library/implementing-health-checks/](https://aws.amazon.com/builders-library/implementing-health-checks/) health checks of just the instance, not testing dependencies or workflows (often referred to as *deep* health checks). This will help prevent unnecessary replacement of instances during transient errors affecting the workload. 

 Observability and the ability to detect failure in subsystems is critical for routing around failure to be successful. You have to know the scope of impact so the affected resources can be marked as unhealthy or failed and taken out of service so they can be routed around. For example, if a single AZ experiences a partial service impairment, your instrumentation will need to be able to identify that there is an AZ-localized issue to route around all resources in that AZ until it has recovered. 

 Being able to route around failure might also require additional tooling depending on the environment. Using the previous example with EC2 instances behind an ALB, imagine that instances in one AZ might be passing local health checks, but an isolated AZ impairment is causing them to fail to connect to their database in a different AZ. In this case, the load balancing health checks won’t take those instances out of service. A different automated mechanism would be needed to [remove the AZ from the load balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-subnets.html) or force the instances to fail their health checks, which depends on identifying that the scope of impact is an AZ. For workloads that aren’t using a load balancer, a similar method would be needed to prevent resources in a specific AZ from accepting units of work or removing capacity from the AZ altogether. 

 In some cases, the shift of work to a redundant subsystem can't be automated, like the failover of a primary to secondary database where the technology doesn't provide its own leader election. This is a common scenario for [AWS multi-Region architectures](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/). Because these types of failovers require some amount of downtime to accomplish, can't be immediately reversed, and leave the workload without redundancy for a period of time, it's important to have a human in the decision-making process. 

 Workloads that can embrace a less strict consistency model can achieve shorter MTTRs by using multi-Region failover automation to route around failure. Features like [Amazon S3 cross-Region replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) or [Amazon DynamoDB global tables](https://aws.amazon.com/dynamodb/global-tables/) provide multi-Region capabilities through eventually consistent replication. Furthermore, using a relaxed consistency model is beneficial when we consider the CAP theorem. During network failures that impact connectivity to stateful subsystems, if the workload chooses availability over consistency, it can still provide non-error responses, another way of routing around failure. 

 Routing around failure can be implemented with two different strategies. The first strategy is by implementing static stability by pre-provisioning enough resources to handle the complete load of the failed subsystem. This can be a single EC2 instance or it might be an entire AZ worth of capacity. Attempting to provision new resources during a failure increases the MTTR and adds a dependency to a control plane in your recovery path. However, it comes at additional cost. 

 The second strategy is to route some of the traffic from the failed subsystem to others and [load shed the excess traffic](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload) that cannot be handled by the remaining capacity. During this period of degradation, you can scale up new resources to replace the failed capacity. This approach has a longer MTTR and creates a dependency on a control plane, but costs less in standby, spare capacity. 

## Return to a known good state
<a name="return-to-a-known-good-state"></a>

 Another common approach for mitigation during repair is returning the workload to a previous known good state. If a recent change might have caused the failure, rolling back that change is one way to return to the previous state. 

 In other cases, transient conditions might have caused the failure, in which case, restarting the workload might mitigate the impact. Let's examine both of these scenarios. 

 During a deployment, minimizing the MTTD and MTTR relies on observability and automation. Your deployment process must continually watch the workload for the introduction of increased error rates, increased latency, or anomalies. After these are recognized, it should halt the deployment process. 

 There are various [deployment strategies](https://docs.aws.amazon.com/whitepapers/latest/overview-deployment-options/deployment-strategies.html), like in-place deployments, blue/green deployments, and rolling deployments. Each one of these might use a different mechanism to return to a known-good state. It can automatically roll back to the previous state, shift traffic back to the blue environment, or require manual intervention. 

 CloudFormation [offers the capability to automatically rollback](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-rollback-triggers.html) as part of its create and update stack operations, as does [AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployments-rollback-and-redeploy.html#deployments-rollback-and-redeploy-automatic-rollbacks). CodeDeploy also supports blue/green and rolling deployments. 

 To take advantage of these capabilities and minimize your MTTR, consider automating all of your infrastructure and code deployments through these services. In scenarios where you cannot use these services, consider implementing the [saga pattern](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/implement-the-serverless-saga-pattern-by-using-aws-step-functions.html) with AWS Step Functions to rollback failed deployments. 

 When considering *restart*, there are several different approaches. These range from rebooting a server, the longest task, to restarting a thread, the shortest task. Here is a table that outlines some of the restart approaches and approximate times to complete (representative of orders of magnitude difference, these are not exact). 

 


|  Fault recovery mechanism  |  Estimated MTTR  | 
| --- | --- | 
|  Launch and configure new virtual server  |  15 minutes  | 
|  Redeploy the software  |  10 minutes  | 
|  Reboot server  |  5 minutes  | 
|  Restart or launch container  |  2 seconds  | 
|  Invoke new serverless function  |  100 ms  | 
|  Restart process  |  10 ms  | 
|  Restart thread  |  10 μs  | 

 Reviewing the table, there are some clear benefits for MTTR in using containers and serverless functions (like [AWS Lambda](https://aws.amazon.com/lambda/)). Their MTTR is orders of magnitude faster than restarting a virtual machine or launching a new one. However, using fault isolation through software modularity is also beneficial. If you can contain failure to a single process or thread, recovering from that failure is much faster than restarting a container or server. 

 As a general approach to recovery, you can move from bottom to top: 1/Restart, 2/Reboot, 3/Re-image/Redeploy, 4/Replace. However, once you get to the reboot step, routing around failure is usually a faster approach (usually taking at most 3–4 minutes). So, to most quickly mitigate impact after an attempted restart, route around the failure, and then, in the background, continue the recovery process to return capacity to your workload. 

**Rule 10**  
 Focus on impact mitigation, not problem resolution. Take the fastest path back to normal operation. 

## Failure diagnosis
<a name="failure-diagnosis"></a>

 Part of the repair process after detection is the diagnosis period. This is the period of time where operators try to determine what is wrong. This process might involve querying logs, reviewing Operational Health metrics, or logging into hosts to troubleshoot. All of these actions require time, so creating tools and runbooks to expedite these actions can help reduce the MTTR as well. 

## Runbooks and automation
<a name="runbooks-and-automation"></a>

 Similarly, after you determine what is wrong and what course of action will repair the workload, operators typically need to perform some set of steps to do that. For example, after a failure, the fastest way to repair the workload might be to restart it, which can involve multiple, ordered steps. Utilizing a runbook that either automates these steps or provides specific direction to an operator will expedite the process and help reduce the risk of inadvertent action. 

# Increasing MTBF
<a name="increasing-mtbf"></a>

 The final component to improving availability is increasing the MTBF. This can apply to both the software as well as the AWS services used to run it. 

## Increasing distributed system MTBF
<a name="increasing-distributed-system-mtbf"></a>

 One way to increase MTBF is to reduce defects in the software. There are several ways to do this. Customers can use tools like [Amazon CodeGuru Reviewer](https://aws.amazon.com/codeguru/) to find and remediate common errors. You should also perform comprehensive peer code reviews, unit tests, integration tests, regression tests, and load tests on software before it is deployed to production. Increasing the amount of code coverage in tests will help ensure that even uncommon code execution paths are tested. 

 Deploying smaller changes can also help prevent unexpected outcomes by reducing the complexity of change. Each activity provides an opportunity to identify and fix defects before they can ever be invoked. 

 Another approach to preventing failure is [regular testing](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/test-reliability.html). Implementing a chaos engineering program can help test how your workload fails, validate recovery procedures, and help find and fix failure modes before they occur in production. Customers can use [AWS Fault Injection Simulator](https://aws.amazon.com/fis/) as part of their chaos engineering experiment toolset. 

 Fault tolerance is another way to prevent failure in a distributed system. Fail-fast modules, retries with exponential backoff and jitter, transactions, and idempotency are all techniques to help make workloads fault tolerant. 

 Transactions are a group of operations that adhere to the ACID properties. They are as follows: 
+  **Atomicity** – Either all of the actions happen or none of them will happen. 
+  **Consistency** – Each transaction leaves the workload in a valid state. 
+  **Isolation** – Transactions performed concurrently leave the workload in the same state as if they had been performed sequentially. 
+  **Durability** – Once a transaction commits, all of its effects are preserved even in the case of workload failure. 

 Retries with [exponential backoff and jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) allow you to overcome transient failures caused by Heisenbugs, overload, or other conditions. When transactions are idempotent, they can be retried multiple times without side effects. 

 If we consider the effect of a Heisenbug on a fault-tolerant hardware configuration, we'd be fairly unconcerned since the probability of the Heisenbug appearing on both the primary and redundant subsystem is infinitesimally small. (See Jim Gray, "[Why Do Computers Stop and What Can Be Done About It?](https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2018/Papers/gray85-easy.pdf)", June 1985, Tandem Technical Report 85.7.) In distributed systems, we want to achieve the same outcomes with our software. 

 When a Heisenbug is invoked, it's imperative that the software quickly detects the incorrect operation and fails so that it can be tried again. This is achieved through defensive programming, and validating inputs, intermediate results, and output. Additionally, processes are isolated and share no state with other processes. 

 This modular approach ensures that the scope of impact during failure is limited. Processes fail independently. When a process does fail, the software should use “process-pairs” to retry the work, meaning a new process can assume the work of a failed one. To maintain the reliability and integrity of the workload, each operation should be treated as an ACID transaction. 

 This allows a process to fail without corrupting the state of the workload by aborting the transaction and rolling back any changes made. This allows the recovery process to retry the transaction from a known-good state and restart gracefully. This is how software can be fault-tolerant to Heisenbugs. 

 However, you should not aim to make software fault-tolerant to Bohrbugs. These defects must be found and removed before the workload enters production since no level of redundancy will ever achieve correct outcome. (See Jim Gray, "[Why Do Computers Stop and What Can Be Done About It?](https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf)", June 1985, Tandem Technical Report 85.7.) 

 The final way to increase MTBF is to reduce the scope of impact from failure. Using [fault isolation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/use-fault-isolation-to-protect-your-workload.html) through modularization to create fault containers is a primary way to do so as outlined earlier in *Fault tolerance and fault isolation*. Reducing the failure rate improves availability. AWS uses techniques like dividing services into control planes and data planes, [Availability Zone Independence](https://aws.amazon.com/builders-library/static-stability-using-availability-zones) (AZI), [Regional isolation](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/), [cell-based architectures](https://www.youtube.com/watch?v=swQbA4zub20), and [shuffle-sharding](https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding) to provide fault isolation. These are also patterns that can be used by AWS customers as well. 

 For example, let's review a scenario where a workload placed customers into different fault containers of its infrastructure that serviced at most 5% of the total customers. One of these fault containers experiences an event that increased latency beyond the client timeout for 10% of requests. During this event, for 95% of customers, the service was 100% available. For the other 5%, the service appeared to be 90% available. This results in an availability of 1 − (5% *o**f* *c**u**s**t**o**m**e**r**s*×10% *o**f* *t**h**e**i**r* *r**e**q**u**e**s**t**s*) = 99.5% instead of 10% of requests failing for 100% of customers (resulting in a 90% availability). 

**Rule 11**  
Fault isolation decreases scope of impact and increases the MTBF of the workload by reducing the overall failure rate. 

## Increasing Dependency MTBF
<a name="increasing-dependency-mtbf"></a>

 The first method to increase your AWS dependency MTBF is through using [fault isolation](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/use-fault-isolation-to-protect-your-workload.html). Many AWS services offer a level of isolation at the AZ, meaning a failure in one AZ does not affect the service in a different AZ. 

 Using redundant EC2 instances in multiple AZs increases subsystem availability. AZI provides a sparing capability inside a single Region, allowing you to increase your availability for AZI services. 

 However, not all AWS services operate at the AZ level. Many others offer regional isolation. In this case, where the designed-for availability of the regional service doesn't support the overall availability required for your workload, you might consider a multi-Region approach. Each Region offers an isolated instantiation of the service, equivalent to sparing. 

 There are various services that help make building a multi-Region service easier. For example: 
+  [Amazon Aurora Global Database](https://aws.amazon.com/rds/aurora/global-database/) 
+  [Amazon DynamoDB global tables](https://aws.amazon.com/dynamodb/global-tables/) 
+  [Amazon ElastiCache (Redis OSS) – Global Datastore](https://aws.amazon.com/elasticache/redis/global-datastore/) 
+  [AWS Global Accelerator](https://aws.amazon.com/global-accelerator/) 
+  [Amazon S3 Cross-Region Replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) 
+  [Amazon Route 53 Application Recovery Controller](https://aws.amazon.com/route53/application-recovery-controller/) 

 This document doesn't delve into the strategies of building multi-Region workloads, but you should weigh the availability benefits of multi-Region architectures with the additional cost, complexity, and operational practices they require to meet your desired availability goals. 

 The next method to increase dependency MTBF is by designing your workload to be statically stable. For example, you have a workload that serves product information. When your customers make a request for a product, your service makes a request to an external metadata service to retrieve product details. Then your workload returns all of that info to the user. 

 However, if the metadata service is unavailable, the requests made by your customers fail. Instead, you can asynchronously pull or push the metadata locally to your service to be used to answer requests. This eliminates the synchronous call to the metadata service from your critical path. 

 Additionally, because your service is still available even when the metadata service is not, you can remove it as a dependency in your availability calculation. This example is dependent on the assumption that the metadata doesn’t change frequently and that serving stale metadata is better than the request failing. Another similar example is [serve-stale](https://www.rfc-editor.org/rfc/rfc8767) for DNS that allows data to be kept in the cache beyond the TTL expiry and used for responses when a refreshed answer is not readily available. 

 The final method to increase dependency MTBF is to reduce the scope of impact from failure. As discussed earlier, failure is not a binary event, there are degrees of failure. This is the effect of modularization; failure is contained to just the requests or users being serviced by that container. 

 This results in fewer failures during an event which ultimately increases availability of the overall workload by limiting the scope of impact. 

## Reducing common sources of impact
<a name="reducing-common-sources-of-impact"></a>

 In 1985, Jim Gray discovered, during a study at Tandem Computers, that failure was primarily driven by two things: software and operations. (See Jim Gray, "[Why Do Computers Stop and What Can Be Done About It?](https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf)", June 1985, Tandem Technical Report 85.7.) Even after 36 years later, this continues to be true. Despite advances in technology, there isn't an easy solution to these problems, and the major sources of failure haven't changed. Addressing failures in software was discussed in the beginning of this section, so the focus here will be operations and reducing the frequency of failure. 

### Stability compared with features
<a name="stability-compared-with-features"></a>

 If we refer back to the failure rates for software and hardware graph in the section [Distributed system availability](distributed-system-availability.md), we can notice that defects are added in each software release. This means that any change to the workload introduces increased risk of failure. These changes are typically things like new features, which provides a corollary. Higher availability workloads will favor stability over new features. Thus, one of the simplest ways to improve availability is to deploy less often or deliver fewer features. Workloads that deploy more frequently will inherently have a lower availability than those that do not. However, workloads that fail to add features do not keep up with customer demand and can become less useful over time. 

 So, how do we continue to innovate and release features safely? The answer is standardization. What is the correct way to deploy? How do you order deployments? What are the standards for testing? How long do you wait between stages? Do your unit tests cover enough of the software code? These are questions that standardization will answer and prevent issues caused by things like not load testing, skipping deployment stages, or deploying too quickly to too many hosts. 

 The way that you implement standardization is through automation. It reduces the chance of human mistakes and lets computers do the thing they're good at, which is doing the same thing over and over the same way every time. The way you stick standardization and automation together is to set goals. Goals like no manual changes, host access only through contingent authorization systems, writing load tests for every API, and so on. Operational excellence is a cultural norm that can require substantial change. Establishing and tracking performance against a goal helps drive cultural change that will have a broad impact on workload availability. The [AWS Well-Architected Operational Excellence pillar](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html) provides comprehensive best practices for operational excellence. 

### Operator safety
<a name="operator-safety"></a>

 The other major contributor to operational events that introduce failure are people. Humans make mistakes. They might use the wrong credentials, enter the wrong command, press Enter too soon, or miss a critical step. Taking manual action consistently results in error, resulting in failure. 

 One of the major causes for operator errors are confusing, unintuitive, or inconsistent user interfaces. Jim Gray also noted in his 1985 study that “interfaces that ask the operator for information or ask him to perform some function must be simple, consistent, and operator fault-tolerant.” (See Jim Gray, "[Why Do Computers Stop and What Can Be Done About It?](https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf)", June 1985, Tandem Technical Report 85.7.) This insight continues to be true today. There are numerous examples over the past three decades throughout the industry where a confusing or complex user interface, lack of confirmation or instructions, or even just unfriendly human language caused an operator to do the wrong thing. 

**Rule 12**  
Make it easy for operators to do the right thing. 

### Preventing overload
<a name="preventing-overload"></a>

 The final common contributor of impact is your customers, the actual users of your workload. Successful workloads tend to get used, a lot, but sometimes that usage outpaces the workload’s ability to scale. There are many things that can happen, disks can become full, thread pools might get exhausted, network bandwidth might be saturated, or database connection limits can be reached. 

 There is no failproof method to eliminate these, but proactive monitoring of capacity and utilization through Operational Health metrics will provide early warnings when these failures might occur. Techniques like [load-shedding](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload), [circuit breakers](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-interactions-in-a-distributed-system-to-mitigate-or-withstand-failures.html), and [retry with exponential backoff and jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) can help minimize the impact and increase the success rate, but these situations still represent failure. Automated scaling based on Operational Health metrics can help reduce the frequency of failure due to overload, but might not be able to respond quickly enough to changes in utilization. 

 If you need to ensure the continuously available capacity for customers, you have to make tradeoffs on availability and cost. One way to ensure lack of capacity doesn’t lead to unavailability is to provide each customer with a quota and ensure your workload’s capacity is scaled to provide 100% of the allocated quotas. When customers exceed their quota, they get throttled, which isn't a failure and doesn’t count against availability. You will also need to closely track your customer base and forecast future utilization to keep enough capacity provisioned. This ensures your workload isn't driven to failure scenarios through over consumption by your customers. 
+  [Amazon Builders' Library – Using load shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/) 
+  [Amazon Builders' Library – Fairness in multi-tenant systems](https://aws.amazon.com/builders-library/fairness-in-multi-tenant-systems) 

For example, let’s examine a workload that provides a storage service. Each server in the workload can support 100 downloads per second, customers are provided a quota or 200 downloads per second, and there are 500 customers. To be able to support this volume of customers, the service needs to provide capacity for 100,000 downloads per second, which requires 1,000 servers. If any customer exceeds their quota, they get throttled, which ensures sufficient capacity for every other customer. This is a simple example of one way to avoid overload without rejecting units of work. 