

 This whitepaper is for historical reference only. Some content might be outdated and some links might not be available.

# Understanding availability
<a name="understanding-availability"></a>

 Availability is one of the primary ways we can quantitatively measure resiliency. We define availability, *A*, as the percentage of time that a workload is available for use. It’s a ratio of its expected “uptime” (being available) to the total time being measured (the expected “uptime” plus the expected “downtime”). 

![\[Picture of equation. A = uptime / (uptime + downtime)\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/availability.png)


 To better understand this formula, we’ll look at how to measure uptime and downtime. First, we want to know how long the workload will go without failure. We call this *mean time between failure* (MTBF), the average time between when a workload begins normal operation and its next failure. Then, we want to know how long it will take to recover after it has failed. 

 We call this *mean time to repair (or recovery)* (MTTR), a period of time when the workload is unavailable while the failed subsystem is repaired or returned to service. An important period of time in the MTTR is the *mean time to detection* (MTTD), the amount of time between a failure occurring and when repair operations begin. The following diagram demonstrates how all of these metrics are related. 

![\[Diagram showing the relationship between MTTD, MTTR, and MTBF\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/availability-metrics.png)


 We can thus express availability, *A*, using MTBF, the time the workload is up, and MTTR, the time the workload is down. 

![\[Picture of equation. A = MTBF / ( MTBF + MTTR)\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/equation2.png)


 And the probability the workload is “down” (that is, not available) is the probability of failure, *F*. 

![\[Picture of equation. F = 1 - A\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/equation3.png)


[Reliability](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/reliability.html) is the ability of a workload to do the right thing, when requested, within the specified response time. This is what availability measures. Having a workload fail less frequently (longer MTBF) or having a shorter repair time (shorter MTTR) improves its availability. 

**Rule 1**  
Less frequent failure (longer MTBF), shorter failure detection times (shorter MTTD), and shorter repair times (shorter MTTR) are the three factors that are used to improve availability in distributed systems. 

**Topics**
+ [Distributed system availability](distributed-system-availability.md)
+ [Availability with dependencies](availability-with-dependencies.md)
+ [Availability with redundancy](availability-with-redundancy.md)
+ [CAP theorem](cap-theorem.md)
+ [Fault tolerance and fault isolation](fault-tolerance-and-fault-isolation.md)

# Distributed system availability
<a name="distributed-system-availability"></a>

 Distributed systems are made up of both software components and hardware components. Some of the software components might themselves be another distributed system. The availability of both the underlying hardware and software components affects the resulting availability of your workload. 

 The calculation of availability using MTBF and MTTR has its roots in hardware systems. However, distributed systems fail for very different reasons than a piece of hardware does. Where a manufacturer can consistently calculate the average time before a hardware component wears out, the same testing can't be applied to the software components of a distributed system. Hardware typically follows the “bathtub” curve of failure rate, while software follows a staggered curve produced by additional defects that are introduced with each new release (see [Software Reliability](https://users.ece.cmu.edu/~koopman/des_s99/sw_reliability).) 

![\[Diagram showing hardware and software failure rates\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/failure-rates.png)


 Additionally, the software in distributed systems typically changes at rates exponentially higher than hardware. For example, a standard magnetic hard drive might have an average annualized failure rate (AFR) of 0.93% which, in practice for an HDD, can mean a lifespan of at least 3–5 years before it reaches the wear-out period, potentially longer (see [Backblaze Hard Drive Data and Stats, 2020](https://www.backblaze.com/b2/hard-drive-test-data.html).) The hard drive doesn't materially change during that lifetime, where, in 3–5 years, as an example, Amazon might deploy more than 450 to 750 million changes to its software systems. (See [Amazon Builders' Library – Automating safe, hands-off deployments](https://aws.amazon.com/about-aws/whats-new/2020/06/new-abl-article-automating-safe-hands-off-deployments/).) 

 Hardware is also subject to the concept of planned obsolescence, that is has a built-in lifespan, and will need to be replaced after a certain period of time. (See [The Great Lightbulb Conspiracy](https://spectrum.ieee.org/tech-history/dawn-of-electronics/the-great-lightbulb-conspiracy).) Software, theoretically, is not subject to this constraint, it doesn't have a wear-out period and can be operated indefinitely. 

 All of this means that the same testing and prediction models used for hardware to generate MTBF and MTTR numbers don’t apply to software. There have been hundreds of attempts to build models to solve this problem since the 1970s, but they all generally fall into two categories, prediction modeling and estimation modeling. (See [List of software reliability models](https://en.wikipedia.org/wiki/List_of_software_reliability_models).) 

 Thus, calculating a forward-looking MTBF and MTTR for distributed systems, and thus a forward-looking availability, will always be derived from some type of prediction or forecast. They may be generated through predictive modeling, stochastic simulation, historical analysis, or rigorous testing, but those calculations are not a guarantee of uptime or downtime. 

 The reasons that a distributed system failed in the past may never reoccur. The reasons it fails in the future are likely to be different and possibly unknowable. The recovery mechanisms required might also be different for future failures than ones used in the past and take significantly different amounts of time. 

 Additionally, MTBF and MTTR are averages. There will be some variance from the average value to the actual values seen (the standard deviation, σ, measures this variation). Thus, workloads may experience shorter or longer time between failures and recovery times in actual production use. 

 That being said, the availability of the software components that makes up a distributed system is still important. Software can fail for numerous reasons (discussed more in the next section) and impacts the workload’s availability. Thus, for highly available distributed systems, equal focus to calculating, measuring, and improving the availability of software components should be given as to hardware and external software subsystems. 

**Rule 2**  
 The availability of the software in your workload is an important factor of your workload’s overall availability and should receive an equal focus as other components. 

 It’s important to note that despite MTBF and MTTR being difficult to predict for distributed systems, they still provide key insights into how to improve availability. Reducing the frequency of failure (higher MTBF) and decreasing the time to recover after failure occurs (lower MTTR) will both lead to a higher empirical availability. 

## Types of failures in distributed systems
<a name="types-of-failures-in-distributed-systems"></a>

 There are generally two classes of bugs in distributed systems that affect availability, affectionately named the *Bohrbug* and *Heisenbug* (see ["A Conversation with Bruce Lindsay", ACM Queue vol. 2, no. 8 – November 2004](http://queue.acm.org/detail.cfm?id=1036486).) 

 A Bohrbug is a repeatable functional software issue. Given the same input, the bug will consistently produce the same incorrect output (like the deterministic Bohr atom model, which is solid and easily detected). These types of bugs are rare by the time a workload gets to production. 

 A Heisenbug is a bug that is transient, meaning that it only occurs in specific and uncommon conditions. These conditions are usually related to things like hardware (for example, a transient device fault or hardware implementation specifics like register size), compiler optimizations and language implementation, limit conditions (for example, temporarily out of storage), or race conditions (for example, not using a semaphore for multi-threaded operations). 

 Heisenbugs make up the majority of bugs in production and are difficult to find because they are elusive and seem to change behavior or disappear when you try to observe or debug them. However, if you restart the program, the failed operation will likely succeed because the operating environment is slightly different, eliminating the conditions that introduced the Heisenbug. 

 Thus, most failures in production are transient and when the operation is retried, it is unlikely to fail again. To be resilient, distributed systems have to be fault tolerant to Heisenbugs. We’ll explore how to this can be achieved in the section [Increasing distributed system MTBF](increasing-mtbf.md#increasing-mtbf.title).

# Availability with dependencies
<a name="availability-with-dependencies"></a>

 In the previous section, we mentioned that hardware, software, and potentially other distributed systems are all components of your workload. We call these components *dependencies*, the things your workload depends on to provide its functionality. There are *hard* dependencies, which are those things that your workload cannot function without, and *soft* dependencies whose unavailability can go unnoticed or tolerated for some period of time. Hard dependencies have a direct impact on your workload’s availability. 

 We might want to try and calculate the theoretical maximum availability of a workload. This is the product of the availability of all of the dependencies, including the software itself, (*α**n* is the availability of a single subsystem) because each one must be operational. 

![\[Picture of equation. A = α1 X α2 X ... X αnsubscript>\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/equation4.png)


 The availability numbers used in these calculations are usually associated with things like SLAs or Service-Level Objectives (SLOs). SLAs define the expected level of service customers will receive, the metrics by which the service is measured, and remediations or penalties (usually monetary) should the service levels not be achieved. 

 Using the above formula, we can conclude that, purely mathematically, a workload can be no more available than any of its dependencies. But in reality, what we typically see is that this is not the case. A workload built using two or three dependencies with 99.99% availability SLAs can still achieve 99.99% availability itself, or higher. 

 This is because as we outlined in the previous section, these availability numbers are estimates. They estimate or predict how often a failure occurs and how quickly it can be repaired. They are not a guarantee of downtime. Dependencies frequently exceed their stated availability SLA or SLO. 

 Dependencies may also have higher internal availability objectives that they target performance against than numbers provided in public SLAs. This provides a level of risk mitigation in meeting SLAs when the unknown or unknowable happens. 

 Finally, your workload might have dependencies whose SLAs can’t be known or don’t offer an SLA or SLO. For example, world-wide internet routing is a common dependency for many workloads, but it’s hard to know which internet service provider(s) your global traffic is using, whether they have SLAs, and how consistent they are across providers. 

 What this all tells us is that computing a maximum theoretical availability is only likely to produce a rough order of magnitude calculation, but by itself is likely not to be accurate or provide meaningful insight. What the math does tell us is that the fewer things that your workload relies on reduces the overall likelihood of failure. The fewer numbers less than one multiplied together, the larger the result. 

**Rule 3**  
 Reducing dependencies can have a positive impact on availability. 

 The math also helps inform the dependency selection process. The selection process affects how you design your workload, how you take advantage of redundancy in dependencies to improve their availability, and whether you take those dependencies as soft or hard. Dependencies that can have impact on your workload should be carefully chosen. The next rule provides guidance on how to do so. 

**Rule 4**  
 In general, select dependencies whose availability goals are equal to or greater than the goals of your workload. 

# Availability with redundancy
<a name="availability-with-redundancy"></a>

 When a workload utilizes multiple, independent, and redundant subsystems, it can achieve a higher level of theoretical availability than by using a single subsystem. For example, consider a workload composed of two identical subsystems. It can be completely operational if either subsystem one or subsystem two is operational. For the whole system to be down, both subsystems must be down at the same time. 

 If one subsystem's probability of failure is 1 − *α*, then the probability that two redundant subsystems being down at the same time is the product of each subsystem's probability of failure, *F* = (1−*α*1) × (1−*α*2). For a workload with two redundant subsystems, using Equation *(3)*, this gives an availability defined as: 

![\[Picture of three equations\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/equation5.png)


 So, for two subsystems whose availability is 99%, the probability that one fails is 1% and the probability that they both fail is (1−99%) × (1−99%) = .01%. This makes the availability using two redundant subsystems 99.99%. 

 This can be generalized to incorporate additional redundant spares, *s**,* as well. In Equation *(5)* we only assumed a single spare, but a workload might have two, three, or more spares so that it can survive the simultaneous loss of multiple subsystems without impacting availability. If a workload has three subsystems and two are spares, the probability that all three subsystems fail at the same time is (1−*α*) × (1−*α*) × (1−*α*) or (1−*α*)3. In general, a workload with *s* spares will only fail if *s* \$1 1 subsystems fail. 

 For a workload with *n* subsystems and *s* spares, *f* is the number of failure modes or the ways that *s* \$1 1 subsystems can fail out of *n*. 

 This is effectively the binomial theorem, the combinatorial math of choosing *k* elements from a set of *n*, or *“**n* *choose* *k**”*. In this case, *k* is *s* \$1 1. 

![\[Picture of four equations\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/equation6.png)


 We can then produce a generalized availability approximation that incorporates the number of failure modes and sparing. (To understand why this in an approximation, refer to Appendix 2 of Highleyman, et al. [Breaking the Availability Barrier](https://www.amazon.com/Breaking-Availability-Barrier-Survivable-Enterprise/dp/1410792331).) 

![\[Picture of four equations\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/equation7.png)


 Sparing can be applied to any dependency that provides resources that fail independently. Amazon EC2 instances in different AZs or Amazon S3 buckets in different AWS Regions are examples of this. Using spares helps that dependency achieve a higher total availability to support the workload’s availability goals. 

**Rule 5**  
 Use sparing to increase the availability of dependencies in a workload. 

 However, sparing comes at a cost. Each additional spare costs the same as the original module, driving cost at least linearly. Building a workload that can use spares also increases its complexity. It must know how to identify dependency failure, weight work away from it to a healthy resource, and manage overall capacity of the workload. 

 Redundancy is an optimization problem. Too few spares, and the workload can fail more frequently than desired, too many spares and the workload costs too much to run. There is a threshold at which adding more spares will cost more than the additional availability they achieve warrants. 

 Using our general availability with spares formula, Equation *(7)*, for a subsystem that has a 99.5% availability, with two spares the workload’s availability is *A* ≈ 1 − (1)(1−.995)3 = 99.9999875% (approximately 3.94 seconds of downtime a year), and with 10 spares we get *A* ≈ 1 − (1)(1−.995)11 = 25.5  9′*s* (the approximate downtime would be 1.26252 × 10−15*m**s* per year, effectively 0). In comparing these two workloads, we've incurred a 5X increase in the cost of sparing to achieve four seconds less downtime a year. For most workloads, the increase in cost would be unwarranted for this increase in availability. The following figure shows this relationship. 

![\[Diagram showing diminishing returns from increased sparing\]](http://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/images/effect-of-sparing.png)


 At three spares and beyond, the result is fractions of a second of expected downtime a year, meaning that after this point you reach the area of diminishing returns. There might be an urge to “just add more” to achieve higher levels of availability, but in reality, the cost benefit disappears very quickly. Using more than three spares does not provide material, noticeable gain for almost all workloads when the subsystem itself has at least a 99% availability. 

**Rule 6**  
 There is an upper bound to the cost efficiency of sparing. Utilize the fewest spares necessary to achieve the required availability. 

 You should consider the unit of failure when selecting the correct number of spares. For example, let's examine a workload that requires 10 EC2 instances to handle peak capacity and they are deployed in a single AZ. 

 Because AZs are designed to be fault isolation boundaries, the unit of failure is not only a single EC2 instance, because an entire AZ worth of EC2 instances can fail together. In this case, you will want to [add redundancy with another AZ](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/use-fault-isolation-to-protect-your-workload.html), deploying 10 additional EC2 instances to handle the load in case of an AZ failure, for a total of 20 EC2 instances (following the pattern of static stability). 

 While this appears to be 10 spare EC2 instances, it is really just a single spare AZ, so we haven't exceeded the point of diminishing returns. However, you can be even more cost efficient while also increasing your availability by utilizing three AZs and deploying five EC2 instances per AZ. 

 This provides one spare AZ with a total of 15 EC2 instances (versus two AZs with 20 instances), still providing the required 10 total instances to serve peak capacity during an event impacting a single AZ. Thus, you should build in sparing to be fault tolerant across all fault isolation boundaries used by the workload (instance, cell, AZ, and Region). 

# CAP theorem
<a name="cap-theorem"></a>

 Another way that we might think about availability is in relation to the CAP theorem. The theorem states that a distributed system, one made up of multiple nodes storing data, cannot simultaneously provide more than two out of the following three guarantees: 
+  **C**onsistency: Every read request receives the most recent write or an error when consistency can’t be guaranteed. 
+  **A**vailability: Every request receives a non-error response, even when nodes are down or unavailable. 
+  **P**artition tolerance: The system continues to operate despite the loss of an arbitrary number of messages between nodes. 

(For more details, see Seth Gilbert and Nancy Lynch, [http://dl.acm.org/citation.cfm?id=564601&CFID=609557487&CFTOKEN=15997970](http://dl.acm.org/citation.cfm?id=564601&CFID=609557487&CFTOKEN=15997970), *ACM SIGACT News*, Volume 33 Issue 2 (2002), pg. 51–59.) 

 Most distributed systems have to tolerate network failures, and thus, network partitioning has to be allowed. This means that these workloads have to make a choice between consistency and availability when a network partition occurs. If the workload chooses availability, then it always returns a response, but with potentially inconsistent data. If it chooses consistency, then during a network partition it would return an error since the workload can’t be sure about the consistency of the data. 

 For workloads whose goal it is to provide higher levels of availability, they might choose Availability and Partition tolerance (AP) to prevent returning errors (being unavailable) during a network partition. This results in requiring a more relaxed [consistency model](https://en.wikipedia.org/wiki/Consistency_model), like eventual consistency or monotonic consistency. 

# Fault tolerance and fault isolation
<a name="fault-tolerance-and-fault-isolation"></a>

 These are two important concepts when we think about availability. Fault tolerance is the ability to [withstand subsystem failure](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-your-workload-to-withstand-component-failures.html) and maintain availability (doing the right thing within an established SLA). To implement fault tolerance, workloads use spare (or redundant) subsystems. When one of the subsystems in a redundant set fails, another picks up its work, typically almost seamlessly. In this case, spares are truly spare capacity; they are available to assume 100% of the work from the failed subsystem. With true spares, multiple subsystem failures are required to produce an adverse impact on the workload. 

 Fault isolation minimizes the scope of impact when a failure does occur. This is typically implemented with modularization. Workloads are broken down into small subsystems that fail independently and can be repaired in isolation. The failure of a module [does not propagate beyond the module](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-interactions-in-a-distributed-system-to-mitigate-or-withstand-failures.html). This idea spans both vertically, across differently functionality in a workload, and horizontally, across multiple subsystems that provide the same functionality. These modules act as fault containers that limit the scope of impact during an event. 

 The architectural patterns of control planes, data planes, and static stability directly support implementing fault tolerance and fault isolation. The Amazon Builders’ Library article [Static stability using Availability Zones](https://aws.amazon.com/builders-library/static-stability-using-availability-zones) provides good definitions for these terms and how they apply to building resilient, highly available workloads. This whitepaper uses these patterns in the section [Designing highly available distributed systems on AWS,](designing-highly-available-distributed-systems-on-aws.md#designing-highly-available-distributed-systems-on-aws.title) and we also summarize their definitions here. 
+  **Control plane** – The part of the workload involved in making changes: adding resources, deleting resources, modifying resources, and propagating those changes to where they are needed. Control planes are typically more complex and have more moving parts than data planes, and are thus statistically more likely to fail and have lower availabilities. 
+  **Data plane** – The part of the workload that provides the day-to-day business functionality. Data planes tend to be simpler and operate at higher volumes than control planes, leading to higher availabilities. 
+  **Static stability** – The ability of a workload to continue correct operation despite dependency impairments. One method of implementation is to remove control plane dependencies from data planes. Another method is to loosely couple workload dependencies. Perhaps the workload doesn’t see any updated information (such as new things, deleted things, or modified things) that its dependency was supposed to have delivered. However, everything it was doing before the dependency became impaired continues to work. 

 When we think about impairment of a workload, there are two high-level approaches we can consider for recovery. The first method is to respond to that impairment after it happens, perhaps using AWS Auto Scaling to add new capacity. The second method is to prepare for those impairments before they happen, maybe by overprovisioning a workload’s infrastructure so that it can continue to operate correctly without needing additional resources. 

 A statically stable system uses the latter approach. It pre-provisions spare capacity to be available during failure. This method avoids creating a dependency on a control plane in the workload’s recovery path to provision new capacity to recover from the failure. Additionally, provisioning new capacity for various resources takes time. While waiting for new capacity your workload can be overloaded by existing demand and experience further degradation, leading to “brown-out” or complete availability loss. However, you should also consider the cost implications of utilizing pre-provisioned capacity against your availability goals. 

 Static stability provides the next two rules for high availability workloads. 

**Rule 7**  
 Don’t take dependencies on control planes in your data plane, especially during recovery. 

**Rule 8**  
 Loosely couple dependencies so your workload can operate correctly despite dependency impairment, where possible. 