

# Reliability
Reliability

 The reliability pillar encompasses the ability of a workload to perform its intended function correctly and consistently when it’s expected to. This includes the ability to operate and test the workload through its total lifecycle. This paper provides in-depth, best practice guidance for implementing reliable workloads on AWS. 

**Topics**
+ [

# Shared Responsibility Model for Resiliency
](shared-responsibility-model-for-resiliency.md)
+ [

# Design principles
](design-principles.md)
+ [

# Definitions
](definitions.md)
+ [

# Understanding availability needs
](understanding-availability-needs.md)

# Shared Responsibility Model for Resiliency
Shared Responsibility Model for Resiliency

 Resiliency is a shared responsibility between AWS and you. It is important that you understand how disaster recovery (DR) and availability, as part of resiliency, operate under this shared model. 

 **AWS responsibility - Resiliency of the cloud** 

 AWS is responsible for resiliency of the infrastructure that runs all of the services offered in the AWS Cloud. This infrastructure comprises the hardware, software, networking, and facilities that run AWS Cloud services. AWS uses commercially reasonable efforts to make these AWS Cloud services available, ensuring service availability meets or exceeds [AWS Service Level Agreements (SLAs)](https://aws.amazon.com/legal/service-level-agreements/). 

 The [AWS Global Cloud Infrastructure](https://aws.amazon.com/about-aws/global-infrastructure/) is designed to allow customers to build highly resilient workload architectures. Each AWS Region is fully isolated and consists of multiple [Availability Zones](https://aws.amazon.com/about-aws/global-infrastructure/regions_az/#Availability_Zones), which are physically isolated partitions of infrastructure. Availability Zones isolate faults that could impact workload resilience, preventing them from impacting other zones in the Region. But at the same time, all zones in an AWS Region are interconnected with high-bandwidth, low-latency networking, over fully redundant, dedicated metro fiber providing high-throughput, low-latency networking between zones. All traffic between zones is encrypted. The network performance is sufficient to accomplish synchronous replication between zones. When an application is partitioned across AZs, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, hurricanes, and more. 

 **Customer responsibility - Resiliency in the cloud** 

 Your responsibility is determined by the AWS Cloud services that you select. This determines the amount of configuration work you must perform as part of your resiliency responsibilities. For example, a service such as Amazon Elastic Compute Cloud (Amazon EC2) requires the customer to perform all of the necessary resiliency configuration and management tasks. Customers that deploy Amazon EC2 instances are responsible for [deploying Amazon EC2 instances across multiple locations](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/use-fault-isolation-to-protect-your-workload.html) (such as AWS Availability Zones), [implementing self-healing](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-your-workload-to-withstand-component-failures.html) using services like Auto Scaling, and using [resilient workload architecture best practices](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/workload-architecture.html) for applications installed on the instances. For managed services, such as Amazon S3 and Amazon DynamoDB, AWS operates the infrastructure layer, the operating system, and platforms, and customers access the endpoints to store and retrieve data. You are responsible for managing resiliency of your data including backup, versioning, and replication strategies. 

 Deploying your workload across multiple Availability Zones in an AWS Region is part of a high availability strategy designed to protect workloads by isolating issues to one Availability Zone, which uses the redundancy of the other Availability Zones to continue serving requests. A Multi-AZ architecture is also part of a DR strategy designed to make workloads better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. DR strategies may also make use of multiple AWS Regions. For example, in an active/passive configuration, service for the workload fails over from its active Region to its DR Region if the active Region can no longer serve requests. 

![\[Chart illustrating the shared resiliency model.\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/shared-model-resiliency.png)


 

 You can use AWS services to achieve your resilience objectives. As a customer, you are responsible for management of the following aspects of your system to achieve resilience in the cloud. For more detail on each service in particular, see [AWS documentation](https://docs.aws.amazon.com/index.html). 

 **Networking, quotas, and constraints** 
+  Best practices for this area of the shared responsibility model are described in detail under [Foundations](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/foundations.html). 
+  Plan your architecture with adequate room to scale and understand the [service quotas](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/manage-service-quotas-and-constraints.html) and constraints of the services you include, based on expected load request increases where applicable. 
+  Design your [network topology](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/plan-your-network-topology.html) to be highly available, redundant, and scalable. 

 **Change management and operational resilience** 
+  [Change management](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/change-management.html) includes how to introduce and manage change in your environment. [Implementing change](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/implement-change.html) requires building and keeping runbooks up to date and deployment strategies for your application and infrastructure. 
+  A resilient strategy for [monitoring workload resources](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/monitor-workload-resources.html) considers all components, including both technical and business metrics, notifications, automation, and analysis. 
+  Workloads in the cloud must [adapt to changes in demand](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-your-workload-to-adapt-to-changes-in-demand.html) scaling in reaction to impairments or fluctuations in usage. 

 **Observability and failure management** 
+  Observing failures through monitoring is required to automate healing so that your workloads can [withstand component failures](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/design-your-workload-to-withstand-component-failures.html). 
+  [Failure management](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/failure-management.html) requires [backing up data](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/back-up-data.html), applying best practices to allow your workload to withstand component failures, and [planning for disaster recovery](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/plan-for-disaster-recovery-dr.html). 

 **Workload architecture** 
+  Your [workload architecture](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/workload-architecture.html) includes how you design services around business domains, apply SOA and distributed system design to prevent failures, and build in capabilities like throttling, retries, queue management, timeouts, and emergency levers. 
+  Rely on proven [AWS solutions](https://aws.amazon.com/solutions/), the [Amazon Builders Library](https://aws.amazon.com/builders-library/), and [serverless patterns](https://serverlessland.com/patterns) to align with best practices and jump start implementations. 
+  Use continuous improvement to decompose your system into distributed services to scale and innovate faster. Use [AWS microservices](https://aws.amazon.com/microservices/) guidance and managed service options to simplify and accelerate your ability to introduce change and innovate. 

 **Continuous testing of critical infrastructure** 
+  [Testing reliability](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/test-reliability.html) means testing at the functional, performance, and chaos levels, as well as adopting incident analysis and game day practices to build expertise in resolving issues that are not well understood. 
+  For both cloud all-in and hybrid applications, knowing how your application behaves when issues arise or components go down allows you to quickly and reliably recover from outages. 
+  Create and document repeatable experiments to understand how your system behaves when things don’t work as expected. These tests will prove effectiveness of your overall resilience and provide a feedback loop for your operational procedures before facing real failure scenarios. 

# Design principles
Design principles

 In the cloud, there are a number of principles that can help you increase reliability. Keep these in mind as we discuss best practices: 
+  **Automatically recover from failure:** By monitoring a workload for key performance indicators (KPIs), you can run automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation of the service. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur. 
+  **Test recovery procedures:** In an on-premises environment, testing is often conducted to prove that the workload works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This approach exposes failure pathways that you can test and fix *before* a real failure scenario occurs, thus reducing risk. 
+  **Scale horizontally to increase aggregate workload availability:** Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure. 
+  **Stop guessing capacity:** A common cause of failure in on-premises workloads is resource saturation, when the demands placed on a workload exceed the capacity of that workload (this is often the objective of denial of service attacks). In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over- or under-provisioning. There are still limits, but some quotas can be controlled and others can be managed (see [Manage Service Quotas and Constraints](manage-service-quotas-and-constraints.md)). 
+  **Manage change through automation**: Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed. 

# Definitions
Definitions

 This whitepaper covers reliability in the cloud, describing best practice for these four areas: 
+  Foundations 
+  Workload Architecture 
+  Change Management 
+  Failure Management 

 To achieve reliability you must start with the foundations—an environment where service quotas and network topology accommodate the workload. The workload architecture of the distributed system must be designed to prevent and mitigate failures. The workload must handle changes in demand or requirements, and it must be designed to detect failure and automatically heal itself. 

**Topics**
+ [

# Resiliency, and the components of reliability
](resiliency-and-the-components-of-reliability.md)
+ [

# Availability
](availability.md)
+ [

# Disaster Recovery (DR) objectives
](disaster-recovery-dr-objectives.md)

# Resiliency, and the components of reliability
Resiliency, and the components of reliability

 Reliability of a workload in the cloud depends on several factors, the primary of which is *Resiliency*: 
+  **Resiliency** is the ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues. 

 The other factors impacting workload reliability are: 
+  Operational Excellence, which includes automation of changes, use of playbooks to respond to failures, and Operational Readiness Reviews (ORRs) to confirm that applications are ready for production operations. 
+  Security, which includes preventing harm to data or infrastructure from malicious actors, which would impact availability. For example, encrypt backups to ensure that data is secure. 
+  Performance Efficiency, which includes designing for maximum request rates and minimizing latencies for your workload. 
+  Cost Optimization, which includes trade-offs such as whether to spend more on EC2 instances to achieve static stability, or to rely on automatic scaling when more capacity is needed. 

 Resiliency is the primary focus of this whitepaper. 

 The other four aspects are also important and they are covered by their respective pillars of the [AWS Well-Architected Framework](https://aws.amazon.com/architecture/well-architected/). Many of the best practices here also address those aspects of reliability, but the focus is on resiliency.

# Availability
Availability

 *Availability* (also known as *service availability*) is both a commonly used metric to quantitatively measure resiliency, as well as a target resiliency objective. 
+  **Availability** is the percentage of time that a workload is available for use. 

 *Available for use* means that it performs its agreed function successfully when required. 

 This percentage is calculated over a period of time, such as a month, year, or trailing three years. Applying the strictest possible interpretation, availability is reduced anytime that the application isn’t operating normally, including both scheduled and unscheduled interruptions. We define *availability* as follows: 

![\[\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/availability-formula.png)

+ Availability is a percentage uptime (such as 99.9%) over a period of time (commonly a month or year) 
+  Common short-hand refers only to the “number of nines”; for example, “five nines” translates to being 99.999% available 
+ Some customers choose to exclude scheduled service downtime (for example, planned maintenance) from the *Total Time* in the formula. However, this is not advised, as your users will likely want to use your service during these times. 

 Here is a table of common application availability design goals and the maximum length of time that interruptions can occur within a year while still meeting the goal. The table contains examples of the types of applications we commonly see at each availability tier. Throughout this document, we refer to these values. 


|  Availability  |  Maximum Unavailability (per year)  |  Application Categories  | 
| --- | --- | --- | 
|  99%  |  3 days 15 hours  |  Batch processing, data extraction, transfer, and load jobs  | 
|  99.9%  |  8 hours 45 minutes  |  Internal tools like knowledge management, project tracking  | 
|  99.95%  |  4 hours 22 minutes  |  Online commerce, point of sale  | 
|  99.99%  |  52 minutes  |  Video delivery, broadcast workloads  | 
|  99.999%  |  5 minutes  |  ATM transactions, telecommunications workloads  | 

**Measuring availability based on requests.** For your service it may be easier to count successful and failed requests instead of “time available for use”. In this case the following calculation can be used: 

![\[Mathematical formula for calculating availability using successful responses divided by valid requests.\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/availability-formula-requests.png)


This is often measured for one-minute or five-minute periods. Then a monthly uptime percentage (time-base availability measurement) can be calculated from the average of these periods. If no requests are received in a given period it is counted at 100% available for that time. 

  

 **Calculating availability with hard dependencies.** Many systems have hard dependencies on other systems, where an interruption in a dependent system directly translates to an interruption of the invoking system. This is opposed to a soft dependency, where a failure of the dependent system is compensated for in the application. Where such hard dependencies occur, the invoking system’s availability is the product of the dependent systems’ availabilities. For example, if you have a system designed for 99.99% availability that has a hard dependency on two other independent systems that each are designed for 99.99% availability, the workload can theoretically achieve 99.97% availability: 

 Availinvok × Avail*dep1* × Avail*dep2* = Availworkload 

 99.99% × 99.99% × 99.99% = 99.97% 

 It’s therefore important to understand your dependencies and their availability design goals as you calculate your own. 

 **Calculating availability with redundant components.** When a system involves the use of independent, redundant components (for example, redundant resources in different Availability Zones), the theoretical availability is computed as 100% minus the product of the component failure rates. For example, if a system makes use of two independent components, each with an availability of 99.9%, the effective availability of this dependency is 99.9999%: 

![\[Diagram showing calculation of availability with redundant components in a system.\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/image2.png)


 Availeffective = *Avail*MAX − ((100%−Availdependency)×(100%−Availdependency)) 

 99.9999% = 100% − (0.1%×0.1%) 

 *Shortcut calculation*: If the availabilities of all components in your calculation consist solely of the digit nine, then you can sum the count of the number of nines digits to get your answer. In the above example two redundant, independent components with three nines availability results in six nines. 

 **Calculating dependency availability.** Some dependencies provide guidance on their availability, including availability design goals for many AWS services. But in cases where this isn’t available (for example, a component where the manufacturer does not publish availability information), one way to estimate is to determine the **Mean Time Between Failure (MTBF)** and **Mean Time to Recover (MTTR)**. An availability estimate can be established by: 

![\[\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/avail-est-formula.png)


 For example, if the MTBF is 150 days and the MTTR is 1 hour, the availability estimate is 99.97%. 

 For additional details, see [ Availability and Beyond: Understanding and improving the resilience of distributed systems on AWS](https://docs.aws.amazon.com/whitepapers/latest/availability-and-beyond-improving-resilience/availability-and-beyond-improving-resilience.html), which can help you calculate your availability. 

 **Costs for availability.** Designing applications for higher levels of availability typically results in increased cost, so it’s appropriate to identify the true availability needs before embarking on your application design. High levels of availability impose stricter requirements for testing and validation under exhaustive failure scenarios. They require automation for recovery from all manner of failures, and require that all aspects of system operations be similarly built and tested to the same standards. For example, the addition or removal of capacity, the deployment or rollback of updated software or configuration changes, or the migration of system data must be conducted to the desired availability goal. Compounding the costs for software development, at very high levels of availability, innovation suffers because of the need to move more slowly in deploying systems. The guidance, therefore, is to be thorough in applying the standards and considering the appropriate availability target for the entire lifecycle of operating the system. 

 Another way that costs escalate in systems that operate with higher availability design goals is in the selection of dependencies. At these higher goals, the set of software or services that can be chosen as dependencies diminishes based on which of these services have had the deep investments we previously described. As the availability design goal increases, it’s typical to find fewer multi-purpose services (such as a relational database) and more purpose-built services. This is because the latter are easier to evaluate, test, and automate, and have a reduced potential for surprise interactions with included but unused functionality. 

# Disaster Recovery (DR) objectives


 In addition to availability objectives, your resiliency strategy should also include Disaster Recovery (DR) objectives based on strategies to recover your workload in case of a disaster event. Disaster Recovery focuses on one-time recovery objectives in response to natural disasters, large-scale technical failures, or human threats such as attack or error. This is different than availability which measures mean resiliency over a period of time in response to component failures, load spikes, or software bugs. 

 **Recovery Time Objective (RTO)** Defined by the organization. RTO is the maximum acceptable delay between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable. 

 **Recovery Point Objective (RPO)** Defined by the organization. RPO is the maximum acceptable amount of time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service. 

![\[Business continuity timeline showing RPO, disaster event, and RTO with data loss and downtime periods.\]](http://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/images/business-continuity.png)


*The relationship of RPO (Recovery Point Objective), RTO (Recovery Time Objective), and the disaster event.*

 RTO is similar to MTTR (Mean Time to Recovery) in that both measure the time between the start of an outage and workload recovery. However MTTR is a mean value taken over several availability impacting events over a period of time, while RTO is a target, or maximum value allowed, for a *single* availability impacting event. 

# Understanding availability needs
Understanding availability needs

 It’s common to initially think of an application’s availability as a single target for the application as a whole. However, upon closer inspection, we frequently find that certain aspects of an application or service have different availability requirements. For example, some systems might prioritize the ability to receive and store new data ahead of retrieving existing data. Other systems prioritize real-time operations over operations that change a system’s configuration or environment. Services might have very high availability requirements during certain hours of the day, but can tolerate much longer periods of disruption outside of these hours. These are a few of the ways that you can decompose a single application into constituent parts, and evaluate the availability requirements for each. The benefit of doing this is to focus your efforts (and expense) on availability according to specific needs, rather than engineering the whole system to the strictest requirement. 


|  Recommendation  | 
| --- | 
|  Critically evaluate the unique aspects to your applications and, where appropriate, differentiate the availability and disaster recovery design goals to reflect the needs of your business.  | 

 Within AWS, we commonly divide services into the “data plane” and the “control plane.” The data plane is responsible for delivering real-time service while control planes are used to configure the environment. For example, Amazon EC2 instances, Amazon RDS databases, and Amazon DynamoDB table read/write operations are all data plane operations. In contrast, launching new EC2 instances or RDS databases, or adding or changing table metadata in DynamoDB are all considered control plane operations. While high levels of availability are important for all of these capabilities, the data planes typically have higher availability design goals than the control planes. Therefore workloads with high availability requirements should avoid run-time dependency on control plane operations.

 Many AWS customers take a similar approach to critically evaluating their applications and identifying subcomponents with different availability needs. Availability design goals are then tailored to the different aspects, and the appropriate work efforts are performed to engineer the system. AWS has significant experience engineering applications with a range of availability design goals, including services with 99.999% or greater availability. AWS Solution Architects (SAs) can help you design appropriately for your availability goals. Involving AWS early in your design process improves our ability to help you meet your availability goals. Planning for availability is not only done before your workload launches. It’s also done continuously to refine your design as you gain operational experience, learn from real world events, and endure failures of different types. You can then apply the appropriate work effort to improve upon your implementation. 

 The availability needs that are required for a workload must be aligned to the business need and criticality. By first defining business criticality framework with defined RTO, RPO, and availability, you can then assess each workload. Such an approach requires that the people involved in implementation of the workload are knowledgeable of the framework, and the impact their workload has on business needs. 