

# The ORR mechanism
<a name="the-orr-mechanism"></a>

To build a mechanism, work backwards from the challenge you want to solve. The ORR was designed by AWS to help prevent the reoccurrence of known, common causes of impact in services without slowing builders down. The design and operations of those services are the inputs. The outputs are what you want to achieve by resolving the business challenge. In our case at AWS, the desired business result was higher levels of availability and resilience in our systems by decreasing the frequency of incidents (*fewer*), decreasing the duration of incidents (*shorter*), and decreasing the scope of impact of an incident (*smaller*). You can start with the same business challenge and results when you create your own ORR mechanism.

The following sections examine each component of the mechanism. Each section describes the AWS approach and provides recommendations for that part of the mechanism. After you have defined your business challenge, use these sections as a guide for building the tool, driving adoption, inspecting the process, and iterating. 

**Topics**
+ [The ORR tool](the-orr-tool.md)
+ [Gaining adoption](gaining-adoption.md)
+ [Inspect the process](inspect-the-process.md)
+ [Iteration](iteration.md)

# The ORR tool
<a name="the-orr-tool"></a>

The main tool for ORR is the checklist of questions itself. AWS has built a web service around this checklist to create templates, provide a consistent user interface (UI), links to cautionary tales, and set up integrations with the AWS ticketing system. This allows teams to perform a self-service review of their workload, record the results, understand their residual risk, and track action items that result from the review, which can be directly added to their backlog. 

Let’s take a look at an example of a question AWS might ask in one of those checklists. Some systems choose to implement [certificate pinning](https://docs.aws.amazon.com/acm/latest/userguide/acm-bestpractices.html#best-practices-pinning). While there are some potential security benefits, this practice poses a significant availability risk if the pinned certificate is replaced, which can occur for any number of reasons. A question and guidance in your ORR checklist for certificate pinning might look like the following. 


| **Question:** Do any of your hosts pin certificates?  | 
| --- | 
|  **Guidance** We recommend against using certificate pinning because it introduces a potential availability risk. If the certificate to which you pin is replaced, your application will fail to connect. If your use case requires pinning, we recommend that you pin to a certificate authority (CA) rather than to an individual certificate.  ☐ Yes \$1 High Risk  ☐ No \$1 No Risk   | 

If you haven’t had an incident related to certificate pinning, or it’s not a high-priority item to address across your enterprise, then don’t include this question. The ORR checklist is most effective when it’s focused on incidents that present critical risks. These are the types of risks that would prevent a General Availability (GA) launch of a service. Medium or low risks aren’t included in the ORR to keep it a lightweight process that doesn’t overburden teams and reduce their agility and ability to innovate. 

## Customer recommendations
<a name="customer-recommendations"></a>

To get started with an ORR program, you don’t need the same level of tooling that AWS has built. The most important component is generating the questions themselves. It is recommended to review three different categories: 
+ Real incidents that you’ve had in the past 
+ Near-misses that you’ve had in the past 
+ The failure modes that haven’t occurred, but that you’re concerned about 

Out of this set of categories, you can begin to develop questions and associated best practices that can either prevent, or reduce in scope of impact or duration of, those incidents in the future. You can take lessons you’ve learned in both AWS as well as on-premises environments, they aren’t exclusive to operating in the cloud. See [Appendix A: Creating ORR guidance from an incident](appendix-a-creating-orr-guidance-from-an-incident.md) for a complete example of how you can generate ORR guidance from an incident. See [Appendix B: Example ORR questions](appendix-b-example-orr-questions.md) for example questions that you can use to start building your own ORR checklist, keeping in mind that these are only examples and you should tailor the checklist for your specific use cases, environment, and workloads. 

To get started building your own checklist, it is suggested to group your questions and develop content in the following areas. 
+ **Architecture** — The focus is on how you’ve built your architecture, the dependencies your workload has taken, how you scale and manage capacity, and how you protect your workload from its customers (for example, preventing overload). 
+ **Release quality** — This section focuses on how you test and deploy changes to your workload including detecting problems, automated and manual rollback procedures, and how you phase changes incrementally to your systems. 
+ **Event management** — These questions focus on the processes and procedures required to deal with an event when it does occur, including topics like paging an on-call operator, the location and coverage of runbooks, the instrumentation and alarms associated with your workload, and metrics and dashboards you use to understand your workload’s state. 

Your questions may address people, process, and technology in each area. You may also choose to organize the checklist content in more granular categories, such as the following: 
+ Deployment safety 
+ Defense against customers 
+ Defense against dependencies 
+ Data recovery 
+ Operator safety 
+ Blast radius containment 
+ Event detection 
+ Service restart 
+ Forensics 
+ Escalation 

Using [custom lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html), you can build your checklists into the [AWS Well-Architected Tool](https://aws.amazon.com/well-architected-tool/). You might decide to track action items from your ORR in a tool such as [AWS System Manager OpsCenter](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter.html). You also might choose to use the results of [post-incident analysis](https://docs.aws.amazon.com/incident-manager/latest/userguide/analysis.html) in [AWS System Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) as inputs to developing your questions. AWS offers several different engagement models to help you build your own ORR checklist to complement what you’re doing with Well-Architected. Contact your account team for additional details. 

# Gaining adoption
<a name="gaining-adoption"></a>

While the ORR name may imply on the surface that it is a “pre-launch” checklist, the process is actually built into the entire Software Development Lifecycle (SDLC). To be the most effective, ORRs should be integrated with, and adopted across that lifecycle. The following diagram demonstrates how AWS views the adoption spectrum for ORRs. 

![\[Diagram showing the ORR adoption and implementation lifecycle tied to the SDLC\]](http://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/images/orr-lifecycle-tied-to-sdlc.png)


Figure 2 shows that the **ORR Lifecycle for New Service and Iterations** is initiated when a new service, new feature, or architecture change is proposed. The **ORR Cycle Start** phase begins during the *Design* phases of the SDLC process. Teams start to answer the design and architecture questions. At the same time, the team has a holistic view of all ORR questions that are associated with the upcoming stages of the SDLC. During the **Mid-Cycle Check-in** phase teams start to answer *Development* and *Testing* related questions. Finally, in the **ORR Conclusion and Follow-up** phase, the team wraps up the ORR checklist and develops their risk mitigation and follow-up plan. 

In addition to the ORR performed through the SDLC process, at least annually, teams are expected to perform an ORR on their full service using a checklist tailored to that event. This helps verify that they stay up to date with new or updated best practices, and also that nothing has changed within their systems. 

To support adoption of the tool across these different phases, AWS uses different checklists for different occasions and workload types. A few examples are: 
+ Launching a new public service 
+ Launching a new major feature 
+ Launching a new minor feature 
+ Recurring annual review 
+ Serverless workload 
+ User console 
+ Server agent 

## Customer recommendations
<a name="customer-recommendations-1"></a>

Once you have established a working mental model for the tool, identify stakeholders who will be impacted by it. You might ask questions like: "Who needs to contribute to it?" or "What do they need to do in order to adopt and implement the tool?"

One way to drive adoption is to start small and expand. Find a new workload that is being built or one that is targeted for migration to AWS. Pilot the ORR process with that workload and generate lessons learned. Another approach would be to conduct ORRs on your most critical workloads first. An important lesson in driving adoption of anything is to “make it easy to do the right thing”. The simpler it is for teams to consume and use the tool as part of their day-to-day business, the easier it is to gain broad adoption. 

One of the most important factors to gaining adoption is ensuring the checklists are aligned with the outcome you want to achieve. As mentioned previously, keep the checklist questions to the minimum required to address critical risks. Verify that the checklists align with the type of workload being evaluated. Minimize the possibility for “not applicable” answers to make sure the process doesn’t introduce unnecessary overhead and take additional time from your builders. 

# Inspect the process
<a name="inspect-the-process"></a>

AWS uses several different inspection processes for the ORR. First, AWS holds a weekly operational metrics meeting attended by thousands of engineers and leadership up to the Senior Vice President (SVP) level. Each week, different service teams present their metrics and dashboards to the group. One of the metrics that is inspected is when their last ORR was performed and the number of outstanding action items they have. 

Secondly, the ORR mechanism is inspected as part of the ORR Conclusion and Follow-up phase. During this phase, teams generate a narrative to describe successes, risks, mitigations, and the tracking mechanisms to close the action items. This narrative is reviewed by senior AWS leadership for new service or major feature launches. The results of the ORR are reviewed during a scheduled meeting with an audience including the engineering team, principal engineers in their organization, leadership and management, and any stakeholders from dependencies or customers of the service. During the meeting, attendees review the completed checklist and provide feedback on the findings. Any high-criticality findings are escalated to leadership as input to a go or no-go launch decision. 

Finally, the effectiveness of the ORR mechanism is inspected during the COE process. The COE template asks questions like “When was your last ORR performed?” and “Would any ORR recommendations have reduced or avoided the impact of this event?”. This helps AWS gauge whether the ORR mechanism is effective at preventing or diminishing the impact of events or whether the mechanism should be iterated upon to make it more effective, for example, by adding or changing questions. 

## Customer recommendations
<a name="customer-recommendations-2"></a>

You’ll need to consider how you will inspect your mechanism. This is key in knowing whether the mechanism is actually helping you achieve your desired outcomes. What we find is that top-down buy-in to mechanisms not only helps drive their adoption, but creates an inspection process that is effective at achieving the desired business results. For ORRs to be a successful program for your business, you will need to create an inspection process that has the gravity to drive both cultural change and adoption of the mechanism’s tool. 

Use multiple perspectives for inspection. You should seek diverse input from product management, IT leadership, developers, and engineers. These different perspectives will give you different insights to the effectiveness of the mechanism and how you may need to alter it. 

# Iteration
<a name="iteration"></a>

It’s unlikely that a mechanism will operate as designed from day one. It takes time to experiment with various tools and find one that works. Adoption can also take significant effort to push out to a large population. Inspection starts when the tool starts to become broadly adopted, but at each stage you may find that alterations need to be made to the mechanism. Here are few examples of how AWS iterates on the ORR mechanism. 

First, AWS constantly seeks feedback on the ORR mechanism from its users, the AWS service teams. This drives the creation of new checklists for different occasions or different types of workloads (like a serverless application, user console, or server agent) so that each one is as pertinent as possible for its consumers. It also helps curate the guidance and questions used in each checklist template. Finally, it also drives enhancements in the user experience provided by the ORR web service. 

Another way AWS iterates on the ORR mechanism is through a specialist engineering community called “Operational Champions” or “Ops Champion” for short. They provide two different functions for the ORR program. The first is as part of the ORR process itself. Teams engage an Ops Champion during their review. The Ops Champion challenges the team on their answers to the checklist, provides context on the adoption and prioritization of best practices, and ends up influencing everything from workload architecture to operational culture in the team. They are part of the complete process, including the review meeting and in retrospectives after the ORR is complete to review lessons learned. 

The second function they provide is as a working group to continue to unify and document emerging best practices from around our decentralized service teams to avoid pockets of institutional knowledge. They focus on the ORR questions to ensure they are asking the right thing, providing the right guidance, verifying results can be measured, determining risk severity, or developing solutions to make the best practices easier to adopt and implement. They review the outcomes of COEs and create new lessons learned and new best practices. There is a tight coupling between the COE process and ORR, we use the lessons learned to continually generate new content to deal with evolving risks in distributed systems. We also use that information to ensure we prioritize the right risks for inclusion into the ORR checklists. 

## Customer recommendations
<a name="customer-recommendations-3"></a>

Quick iteration has proven to be a valuable approach for building modern distributed systems and is equally valuable in building mechanisms. Just as with the inspection process, seek diverse perspectives to create a more holistic understanding of how your mechanism might need to change. Provide opportunities for honest and, if possible, anonymous feedback on the tool and process. Find out which questions were the most useful or which problems can be solved with automation or centralized solutions. Use this feedback to improve the tool and make it easier to drive greater adoption. 

Additionally, developing a community of operationally focused specialists helps improve the effectiveness of the tool, drives further adoption of the tool, and enhances the ability to iterate on the mechanism. You will likely build your own Ops Champion community as you iterate on your ORR program. AWS Solutions Architects (SAs) and Technical Account Managers (TAMs) can help you develop this community. 