# Prepare
<a name="a-prepare"></a>

**Topics**
+ [

# OPS 4. How do you implement observability in your workload?
](ops-04.md)
+ [

# OPS 5. How do you reduce defects, ease remediation, and improve flow into production?
](ops-05.md)
+ [

# OPS 6. How do you mitigate deployment risks?
](ops-06.md)
+ [

# OPS 7. How do you know that you are ready to support a workload?
](ops-07.md)

# OPS 4. How do you implement observability in your workload?
<a name="ops-04"></a>

Implement observability in your workload so that you can understand its state and make data-driven decisions based on business requirements.

**Topics**
+ [

# OPS04-BP01 Identify key performance indicators
](ops_observability_identify_kpis.md)
+ [

# OPS04-BP02 Implement application telemetry
](ops_observability_application_telemetry.md)
+ [

# OPS04-BP03 Implement user experience telemetry
](ops_observability_customer_telemetry.md)
+ [

# OPS04-BP04 Implement dependency telemetry
](ops_observability_dependency_telemetry.md)
+ [

# OPS04-BP05 Implement distributed tracing
](ops_observability_dist_trace.md)

# OPS04-BP01 Identify key performance indicators
<a name="ops_observability_identify_kpis"></a>

 Implementing observability in your workload starts with understanding its state and making data-driven decisions based on business requirements. One of the most effective ways to ensure alignment between monitoring activities and business objectives is by defining and monitoring key performance indicators (KPIs). 

 **Desired outcome:** Efficient observability practices that are tightly aligned with business objectives, ensuring that monitoring efforts are always in service of tangible business outcomes. 

 **Common anti-patterns:** 
+  Undefined KPIs: Working without clear KPIs can lead to monitoring too much or too little, missing vital signals. 
+  Static KPIs: Not revisiting or refining KPIs as the workload or business objectives evolve. 
+  Misalignment: Focusing on technical metrics that don’t correlate directly with business outcomes or are harder to correlate with real-world issues. 

 **Benefits of establishing this best practice:** 
+  Ease of issue identification: Business KPIs often surface issues more clearly than technical metrics. A dip in a business KPI can pinpoint a problem more effectively than sifting through numerous technical metrics. 
+  Business alignment: Ensures that monitoring activities directly support business objectives. 
+  Efficiency: Prioritize monitoring resources and attention on metrics that matter. 
+  Proactivity: Recognize and address issues before they have broader business implications. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To effectively define workload KPIs: 

1.  **Start with business outcomes:** Before diving into metrics, understand the desired business outcomes. Is it increased sales, higher user engagement, or faster response times? 

1.  **Correlate technical metrics with business objectives:** Not all technical metrics have a direct impact on business outcomes. Identify those that do, but it's often more straightforward to identify an issue using a business KPI. 

1.  **Use [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html):** Employ CloudWatch to define and monitor metrics that represent your KPIs. 

1.  **Regularly review and update KPIs:** As your workload and business evolve, keep your KPIs relevant. 

1.  **Involve stakeholders:** Involve both technical and business teams in defining and reviewing KPIs. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+ [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md)
+ [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md)
+ [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md)
+ [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md)

 **Related documents:** 
+ [AWS Observability Best Practices ](https://aws-observability.github.io/observability-best-practices/)
+ [ CloudWatch User Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)
+ [AWS Observability Skill Builder Course ](https://explore.skillbuilder.aws/learn/course/external/view/elearning/14688/aws-observability)

 **Related videos:** 
+ [ Developing an observability strategy ](https://www.youtube.com/watch?v=Ub3ATriFapQ)

 **Related examples:** 
+  [One Observability Workshop](https://catalog.workshops.aws/observability/en-US) 

# OPS04-BP02 Implement application telemetry
<a name="ops_observability_application_telemetry"></a>

 Application telemetry serves as the foundation for observability of your workload. It's crucial to emit telemetry that offers actionable insights into the state of your application and the achievement of both technical and business outcomes. From troubleshooting to measuring the impact of a new feature or ensuring alignment with business key performance indicators (KPIs), application telemetry informs the way you build, operate, and evolve your workload. 

 Metrics, logs, and traces form the three primary pillars of observability. These serve as diagnostic tools that describe the state of your application. Over time, they assist in creating baselines and identifying anomalies. However, to ensure alignment between monitoring activities and business objectives, it's pivotal to define and monitor KPIs. Business KPIs often make it easier to identify issues compared to technical metrics alone. 

 Other telemetry types, like real user monitoring (RUM) and synthetic transactions, complement these primary data sources. RUM offers insights into real-time user interactions, whereas synthetic transactions simulate potential user behaviors, helping detect bottlenecks before real users encounter them. 

 **Desired outcome:** Derive actionable insights into the performance of your workload. These insights allow you to make proactive decisions about performance optimization, achieve increased workload stability, streamline CI/CD processes, and utilize resources effectively. 

 **Common anti-patterns:** 
+  **Incomplete observability:** Neglecting to incorporate observability at every layer of the workload, resulting in blind spots that can obscure vital system performance and behavior insights. 
+  **Fragmented data view:** When data is scattered across multiple tools and systems, it becomes challenging to maintain a holistic view of your workload's health and performance. 
+  **User-reported issues:** A sign that proactive issue detection through telemetry and business KPI monitoring is lacking. 

 **Benefits of establishing this best practice:** 
+  **Informed decision-making:** With insights from telemetry and business KPIs, you can make data-driven decisions. 
+  **Improved operational efficiency:** Data-driven resource utilization leads to cost-effectiveness. 
+  **Enhanced workload stability:** Faster detection and resolution of issues leading to improved uptime. 
+  **Streamlined CI/CD processes:** Insights from telemetry data facilitate refinement of processes and reliable code delivery. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To implement application telemetry for your workload, use AWS services like [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/) and [AWS X-Ray](https://aws.amazon.com/xray/). Amazon CloudWatch provides a comprehensive suite of monitoring tools, allowing you to observe your resources and applications in AWS and on-premises environments. It collects, tracks, and analyzes metrics, consolidates and monitors log data, and responds to changes in your resources, enhancing your understanding of how your workload operates. In tandem, AWS X-Ray lets you trace, analyze, and debug your applications, giving you a deep understanding of your workload's behavior. With features like service maps, latency distributions, and trace timelines, AWS X-Ray provides insights into your workload's performance and the bottlenecks affecting it. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Identify what data to collect:** Ascertain the essential metrics, logs, and traces that would offer substantial insights into your workload's health, performance, and behavior. 

1.  **Deploy the [CloudWatch agent](https://aws.amazon.com/cloudwatch/):** The CloudWatch agent is instrumental in procuring system and application metrics and logs from your workload and its underlying infrastructure. The CloudWatch agent can also be used to collect OpenTelemetry or X-Ray traces and send them to X-Ray. 

1.  **Implement anomaly detection for logs and metrics:** Use [CloudWatch Logs anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/LogsAnomalyDetection.html) and [CloudWatch Metrics anomaly detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html) to automatically identify unusual activities in your application's operations. These tools use machine learning algorithms to detect and alert on anomalies, which enanhces your monitoring capabilities and speeds up response time to potential disruptions or security threats. Set up these features to proactively manage application health and security. 

1.  **Secure sensitive log data:** Use [Amazon CloudWatch Logs data protection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/mask-sensitive-log-data.html) to mask sensitive information within your logs. This feature helps maintain privacy and compliance through automatic detection and masking of sensitive data before it is accessed. Implement data masking to securely handle and protect sensitive details such as personally identifiable information (PII). 

1.  **Define and monitor business KPIs:** Establish [custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html) that align with your [business outcomes](https://aws-observability.github.io/observability-best-practices/guides/operational/business/monitoring-for-business-outcomes/). 

1.  **Instrument your application with AWS X-Ray:** In addition to deploying the CloudWatch agent, it's crucial to [instrument your application](https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html) to emit trace data. This process can provide further insights into your workload's behavior and performance. 

1.  **Standardize data collection across your application:** Standardize data collection practices across your entire application. Uniformity aids in correlating and analyzing data, providing a comprehensive view of your application's behavior. 

1.  **Implement cross-account observability:** Enhance monitoring efficiency across multiple AWS accounts with [Amazon CloudWatch cross-account observability](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Unified-Cross-Account.html). With this feature, you can consolidate metrics, logs, and alarms from different accounts into a single view, which simplifies management and improves response times for identified issues across your organization's AWS environment. 

1.  **Analyze and act on the data:** Once data collection and normalization are in place, use [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/features/) for metrics and logs analysis, and [AWS X-Ray](https://aws.amazon.com/xray/features/) for trace analysis. Such analysis can yield crucial insights into your workload's health, performance, and behavior, guiding your decision-making process. 

 **Level of effort for the implementation plan:** High 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Define workload KPIs](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_identify_kpis.html) 
+  [OPS04-BP03 Implement user activity telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_customer_telemetry.html) 
+  [OPS04-BP04 Implement dependency telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_dependency_telemetry.html) 
+  [OPS04-BP05 Implement transaction traceability](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_dist_trace.html) 

 **Related documents:** 
+  [AWS Observability Best Practices](https://aws-observability.github.io/observability-best-practices/) 
+  [CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html) 
+  [AWS X-Ray Developer Guide](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 
+  [Instrumenting distributed systems for operational visibility](https://aws.amazon.com/builders-library/instrumenting-distributed-systems-for-operational-visibility) 
+  [AWS Observability Skill Builder Course](https://explore.skillbuilder.aws/learn/course/external/view/elearning/14688/aws-observability) 
+  [What's New with Amazon CloudWatch](https://aws.amazon.com/about-aws/whats-new/management-and-governance/?whats-new-content.sort-by=item.additionalFields.postDateTime&whats-new-content.sort-order=desc&awsf.whats-new-products=general-products%23amazon-cloudwatch) 
+  [What's new with AWS X-Ray](https://aws.amazon.com/about-aws/whats-new/developer-tools/?whats-new-content.sort-by=item.additionalFields.postDateTime&whats-new-content.sort-order=desc&awsf.whats-new-products=general-products%23aws-x-ray) 

 **Related videos:** 
+  [AWS re:Invent 2022 - Observability best practices at Amazon](https://youtu.be/zZPzXEBW4P8) 
+  [AWS re:Invent 2022 - Developing an observability strategy](https://youtu.be/Ub3ATriFapQ) 

 **Related examples:** 
+  [One Observability Workshop](https://catalog.workshops.aws/observability) 
+  [AWS Solutions Library: Application Monitoring with Amazon CloudWatch](https://aws.amazon.com/solutions/implementations/application-monitoring-with-cloudwatch) 

# OPS04-BP03 Implement user experience telemetry
<a name="ops_observability_customer_telemetry"></a>

 Gaining deep insights into customer experiences and interactions with your application is crucial. Real user monitoring (RUM) and synthetic transactions serve as powerful tools for this purpose. RUM provides data about real user interactions granting an unfiltered perspective of user satisfaction, while synthetic transactions simulate user interactions, helping in detecting potential issues even before they impact real users. 

 **Desired outcome:** A holistic view of the customer experience, proactive detection of issues, and optimization of user interactions to deliver seamless digital experiences. 

 **Common anti-patterns:** 
+  Applications without real user monitoring (RUM): 
  +  Delayed issue detection: Without RUM, you might not become aware of performance bottlenecks or issues until users complain. This reactive approach can lead to customer dissatisfaction. 
  +  Lack of user experience insights: Not using RUM means you lose out on crucial data that shows how real users interact with your application, limiting your ability to optimize the user experience. 
+  Applications without synthetic transactions: 
  +  Missed edge cases: Synthetic transactions help you test paths and functions that might not be frequently used by typical users but are critical to certain business functions. Without them, these paths could malfunction and go unnoticed. 
  +  Checking for issues when the application is not being used: Regular synthetic testing can simulate times when real users aren't actively interacting with your application, ensuring the system always functions correctly. 

 **Benefits of establishing this best practice:** 
+  Proactive issue detection: Identify and address potential issues before they impact real users. 
+  Optimized user experience: Continuous feedback from RUM aids in refining and enhancing the overall user experience. 
+  Insights on device and browser performance: Understand how your application performs across various devices and browsers, enabling further optimization. 
+  Validated business workflows: Regular synthetic transactions ensure that core functionalities and critical paths remain operational and efficient. 
+  Enhanced application performance: Leverage insights gathered from real user data to improve application responsiveness and reliability. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 To leverage RUM and synthetic transactions for user activity telemetry, AWS offers services like [Amazon CloudWatch RUM](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) and [Amazon CloudWatch Synthetics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html). Metrics, logs, and traces, coupled with user activity data, provide a comprehensive view of both the application's operational state and the user experience. 

### Implementation steps
<a name="implementation-steps"></a>

1.  **Deploy Amazon CloudWatch RUM:** Integrate your application with CloudWatch RUM to collect, analyze, and present real user data. 

   1.  Use the [CloudWatch RUM JavaScript library](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) to integrate RUM with your application. 

   1.  Set up dashboards to visualize and monitor real user data. 

1.  **Configure CloudWatch Synthetics:** Create canaries, or scripted routines, that simulate user interactions with your application. 

   1.  Define critical application workflows and paths. 

   1.  Design canaries using [CloudWatch Synthetics scripts](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html) to simulate user interactions for these paths. 

   1.  Schedule and monitor canaries to run at specified intervals, ensuring consistent performance checks. 

1.  **Analyze and act on data:** Utilize data from RUM and synthetic transactions to gain insights and take corrective measures when anomalies are detected. Use CloudWatch dashboards and alarms to stay informed. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 
+  [OPS04-BP05 Implement distributed tracing](ops_observability_dist_trace.md) 

 **Related documents:** 
+ [ Amazon CloudWatch RUM Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html)
+ [ Amazon CloudWatch Synthetics Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html)

 **Related videos:** 
+ [ Optimize applications through end user insights with Amazon CloudWatch RUM ](https://www.youtube.com/watch?v=NMaeujY9A9Y)
+ [AWS on Air ft. Real-User Monitoring for Amazon CloudWatch ](https://www.youtube.com/watch?v=r6wFtozsiVE)

 **Related examples:** 
+ [ One Observability Workshop ](https://catalog.workshops.aws/observability/en-US/intro)
+ [ Git Repository for Amazon CloudWatch RUM Web Client ](https://github.com/aws-observability/aws-rum-web)
+ [ Using Amazon CloudWatch Synthetics to measure page load time ](https://github.com/aws-samples/amazon-cloudwatch-synthetics-page-performance)

# OPS04-BP04 Implement dependency telemetry
<a name="ops_observability_dependency_telemetry"></a>

 Dependency telemetry is essential for monitoring the health and performance of the external services and components your workload relies on. It provides valuable insights into reachability, timeouts, and other critical events related to dependencies such as DNS, databases, or third-party APIs. When you instrument your application to emit metrics, logs, and traces about these dependencies, you gain a clearer understanding of potential bottlenecks, performance issues, or failures that might impact your workload. 

 **Desired outcome:** Ensure that the dependencies your workload relies on are performing as expected, allowing you to proactively address issues and ensure optimal workload performance. 

 **Common anti-patterns:** 
+  **Overlooking external dependencies:** Focusing only on internal application metrics while neglecting metrics related to external dependencies. 
+  **Lack of proactive monitoring:** Waiting for issues to arise instead of continuously monitoring dependency health and performance. 
+  **Siloed monitoring:** Using multiple, disparate monitoring tools which can result in fragmented and inconsistent views of dependency health. 

 **Benefits of establishing this best practice:** 
+  **Improved workload reliability:** By ensuring that external dependencies are consistently available and performing optimally. 
+  **Faster issue detection and resolution:** Proactively identifying and addressing issues with dependencies before they impact the workload. 
+  **Comprehensive view:** Gaining a holistic view of both internal and external components that influence workload health. 
+  **Enhanced workload scalability:** By understanding the scalability limits and performance characteristics of external dependencies. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implement dependency telemetry by starting with identifying the services, infrastructure, and processes that your workload depends on. Quantify what good conditions look like when those dependencies are functioning as expected, and then determine what data will be needed to measure those. With that information you can craft dashboards and alerts that provide insights to your operations teams on the state of those dependencies. Use AWS tools to discover and quantify the impacts when dependencies cannot deliver as needed. Continually revisit your strategy to account for changes in priorities, goals, and gained insights. 

### Implementation steps
<a name="implementation-steps"></a>

 To implement dependency telemetry effectively: 

1.  **Identify external dependencies:** Collaborate with stakeholders to pinpoint the external dependencies your workload relies on. External dependencies can encompass services like external databases, third-party APIs, network connectivity routes to other environments, and DNS services. The first step towards effective dependency telemetry is being comprehensive in understanding what those dependencies are. 

1.  **Develop a monitoring strategy:** Once you have a clear picture of your external dependencies, architect a monitoring strategy tailored to them. This involves understanding the criticality of each dependency, its expected behavior, and any associated service-level agreements or targets (SLA or SLTs). Set up proactive alerts to notify you of status changes or performance deviations. 

1.  **Use [network monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Network-Monitoring-Sections.html):** Use [Internet Monitor](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html) and [Network Monitor](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/what-is-network-monitor.html), which provide comprehensive insights into global internet and network conditions. These tools help you understand and respond to outages, disruptions, or performance degradations that affect your external dependencies. 

1.  **Stay informed with [AWS Health](https://aws.amazon.com/premiumsupport/technology/aws-health/):** AWS Health is the authoritative source of information about the health of your AWS Cloud resources. Use AWS Health to visualize and receive notifications about any current service events and upcoming changes, such as planned lifecycle events, so you can take steps to mitigate impacts. 

   1.  [Create purpose-fit AWS Health event notifications](https://docs.aws.amazon.com/health/latest/ug/user-notifications.html) to e-mail and chat channels through [AWS User Notifications](https://docs.aws.amazon.com/notifications/latest/userguide/what-is-service.html), and integrate programatically with [your monitoring and alerting tools through Amazon EventBridge](https://docs.aws.amazon.com/health/latest/ug/cloudwatch-events-health.html) or the [AWS Health API](https://docs.aws.amazon.com/health/latest/APIReference/Welcome.html). 

   1.  Plan and track progress on health events that require action by integrating with change management or ITSM tools (like [Jira](https://docs.aws.amazon.com/smc/latest/ag/cloud-sys-health.html) or [ServiceNow](https://docs.aws.amazon.com/smc/latest/ag/sn-aws-health.html)) that you may already use through Amazon EventBridge or the AWS Health API. 

   1.  If you use AWS Organizations, enable [organization view for AWS Health](https://docs.aws.amazon.com/health/latest/ug/aggregate-events.html) to aggregate AWS Health events across accounts. 

1.  **Instrument your application with [AWS X-Ray](https://aws.amazon.com/xray/):** AWS X-Ray provides insights into how applications and their underlying dependencies are performing. By tracing requests from start to end, you can identify bottlenecks or failures in the external services or components your application relies on. 

1.  **Use [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/):** This machine learning-driven service identifies operational issues, predicts when critical issues might occur, and recommends specific actions to take. It's invaluable for gaining insights into dependencies and ensuring they're not the source of operational problems. 

1.  **Monitor regularly:** Continually monitor metrics and logs related to external dependencies. Set up alerts for unexpected behavior or degraded performance. 

1.  **Validate after changes:** Whenever there's an update or change in any of the external dependencies, validate their performance and check their alignment with your application's requirements. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Define workload KPIs](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_identify_kpis.html) 
+  [OPS04-BP02 Implement application telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_application_telemetry.html) 
+  [OPS04-BP03 Implement user activity telemetry](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_customer_telemetry.html) 
+  [OPS04-BP05 Implement transaction traceability](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_observability_dist_trace.html) 
+  [OP08-BP04 Create actionable alerts](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_workload_observability_create_alerts.html) 

 **Related documents:** 
+  [Amazon Personal Health Dashboard User Guide](https://docs.aws.amazon.com/health/latest/ug/what-is-aws-health.html) 
+  [AWS Internet Monitor User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-InternetMonitor.html) 
+  [AWS X-Ray Developer Guide](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html) 
+  [AWS DevOps Guru User Guide](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html) 

 **Related videos:** 
+  [Visibility into how internet issues impact app performance](https://www.youtube.com/watch?v=Kuc_SG_aBgQ) 
+  [Introduction to Amazon DevOps Guru](https://www.youtube.com/watch?v=2uA8q-8mTZY) 
+  [Manage resource lifecycle events at scale with AWS Health](https://www.youtube.com/watch?v=VoLLNL5j9NA) 

 **Related examples:** 
+  [AWS Health Aware](https://github.com/aws-samples/aws-health-aware/) 
+  [Using Tag-Based Filtering to Manage AWS Health Monitoring and Alerting at Scale](https://aws.amazon.com/blogs/mt/using-tag-based-filtering-to-manage-health-monitoring-and-alerting-at-scale/) 

# OPS04-BP05 Implement distributed tracing
<a name="ops_observability_dist_trace"></a>

 Distributed tracing offers a way to monitor and visualize requests as they traverse through various components of a distributed system. By capturing trace data from multiple sources and analyzing it in a unified view, teams can better understand how requests flow, where bottlenecks exist, and where optimization efforts should focus. 

 **Desired outcome:** Achieve a holistic view of requests flowing through your distributed system, allowing for precise debugging, optimized performance, and improved user experiences. 

 **Common anti-patterns:** 
+  Inconsistent instrumentation: Not all services in a distributed system are instrumented for tracing. 
+  Ignoring latency: Only focusing on errors and not considering the latency or gradual performance degradations. 

 **Benefits of establishing this best practice:** 
+ Comprehensive system overview: Visualizing the entire path of requests, from entry to exit.
+  Enhanced debugging: Quickly identifying where failures or performance issues occur. 
+  Improved user experience: Monitoring and optimizing based on actual user data, ensuring the system meets real-world demands. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Begin by identifying all of the elements of your workload that require instrumentation. Once all components are accounted for, leverage tools such as AWS X-Ray and OpenTelemetry to gather trace data for analysis with tools like X-Ray and Amazon CloudWatch ServiceLens Map. Engage in regular reviews with developers, and supplement these discussions with tools like Amazon DevOps Guru, X-Ray Analytics and X-Ray Insights to help uncover deeper findings. Establish alerts from trace data to notify when outcomes, as defined in the workload monitoring plan, are at risk. 

### Implementation steps
<a name="implementation-steps"></a>

 To implement distributed tracing effectively: 

1.  **Adopt [AWS X-Ray](https://aws.amazon.com/xray/):** Integrate X-Ray into your application to gain insights into its behavior, understand its performance, and pinpoint bottlenecks. Utilize X-Ray Insights for automatic trace analysis. 

1.  **Instrument your services:** Verify that every service, from an [AWS Lambda](https://aws.amazon.com/lambda/) function to an [EC2 instance](https://aws.amazon.com/ec2/), sends trace data. The more services you instrument, the clearer the end-to-end view. 

1.  **Incorporate [CloudWatch Real User Monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-RUM.html) and [synthetic monitoring](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Synthetics_Canaries.html):** Integrate Real User Monitoring (RUM) and synthetic monitoring with X-Ray. This allows for capturing real-world user experiences and simulating user interactions to identify potential issues. 

1.  **Use the [CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html):** The agent can send traces from either X-Ray or OpenTelemetry, enhancing the depth of insights obtained. 

1.  **Use [Amazon DevOps Guru](https://aws.amazon.com/devops-guru/):** DevOps Guru uses data from X-Ray, CloudWatch, AWS Config, and AWS CloudTrail to provide actionable recommendations. 

1.  **Analyze traces:** Regularly review the trace data to discern patterns, anomalies, or bottlenecks that might impact your application's performance. 

1.  **Set up alerts:** Configure alarms in [CloudWatch](https://aws.amazon.com/cloudwatch/) for unusual patterns or extended latencies, allowing proactive issue addressing. 

1.  **Continuous improvement:** Revisit your tracing strategy as services are added or modified to capture all relevant data points. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS04-BP01 Identify key performance indicators](ops_observability_identify_kpis.md) 
+  [OPS04-BP02 Implement application telemetry](ops_observability_application_telemetry.md) 
+  [OPS04-BP03 Implement user experience telemetry](ops_observability_customer_telemetry.md) 
+  [OPS04-BP04 Implement dependency telemetry](ops_observability_dependency_telemetry.md) 

 **Related documents:** 
+ [AWS X-Ray Developer Guide ](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html)
+ [ Amazon CloudWatch agent User Guide ](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html)
+ [ Amazon DevOps Guru User Guide ](https://docs.aws.amazon.com/devops-guru/latest/userguide/welcome.html)

 **Related videos:** 
+ [ Use AWS X-Ray Insights ](https://www.youtube.com/watch?v=tl8OWHl6jxw)
+ [AWS on Air ft. Observability: Amazon CloudWatch and AWS X-Ray](https://www.youtube.com/watch?v=qBDBnPkZ-KI)

 **Related examples:** 
+ [ Instrumenting your application for AWS X-Ray](https://aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html)

# OPS 5. How do you reduce defects, ease remediation, and improve flow into production?
<a name="ops-05"></a>

 Adopt approaches that improve flow of changes into production, that activate refactoring, fast feedback on quality, and bug fixing. These accelerate beneficial changes entering production, limit issues deployed, and achieve rapid identification and remediation of issues introduced through deployment activities. 

**Topics**
+ [

# OPS05-BP01 Use version control
](ops_dev_integ_version_control.md)
+ [

# OPS05-BP02 Test and validate changes
](ops_dev_integ_test_val_chg.md)
+ [

# OPS05-BP03 Use configuration management systems
](ops_dev_integ_conf_mgmt_sys.md)
+ [

# OPS05-BP04 Use build and deployment management systems
](ops_dev_integ_build_mgmt_sys.md)
+ [

# OPS05-BP05 Perform patch management
](ops_dev_integ_patch_mgmt.md)
+ [

# OPS05-BP06 Share design standards
](ops_dev_integ_share_design_stds.md)
+ [

# OPS05-BP07 Implement practices to improve code quality
](ops_dev_integ_code_quality.md)
+ [

# OPS05-BP08 Use multiple environments
](ops_dev_integ_multi_env.md)
+ [

# OPS05-BP09 Make frequent, small, reversible changes
](ops_dev_integ_freq_sm_rev_chg.md)
+ [

# OPS05-BP10 Fully automate integration and deployment
](ops_dev_integ_auto_integ_deploy.md)

# OPS05-BP01 Use version control
<a name="ops_dev_integ_version_control"></a>

 Use version control to activate tracking of changes and releases. 

 Many AWS services offer version control capabilities. Use a revision or [source control](https://aws.amazon.com/devops/source-control/) system such as [Git](https://aws.amazon.com/devops/source-control/git/) to manage code and other artifacts such as version-controlled [AWS CloudFormation](https://aws.amazon.com/cloudformation/) templates of your infrastructure. 

 **Desired outcome:** Your teams collaborate on code. When merged, the code is consistent and no changes are lost. Errors are easily reverted through correct versioning. 

 **Common anti-patterns:** 
+  You have been developing and storing your code on your workstation. You have had an unrecoverable storage failure on the workstation and your code is lost. 
+  After overwriting the existing code with your changes, you restart your application and it is no longer operable. You are unable to revert the change. 
+  You have a write lock on a report file that someone else needs to edit. They contact you asking that you stop work on it so that they can complete their tasks. 
+  Your research team has been working on a detailed analysis that shapes your future work. Someone has accidentally saved their shopping list over the final report. You are unable to revert the change and have to recreate the report. 

 **Benefits of establishing this best practice:** By using version control capabilities you can easily revert to known good states and previous versions, and limit the risk of assets being lost. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Maintain assets in version controlled repositories. Doing so supports tracking changes, deploying new versions, detecting changes to existing versions, and reverting to prior versions (for example, rolling back to a known good state in the event of a failure). Integrate the version control capabilities of your configuration management systems into your procedures. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md) 

 **Related videos:** 
+ [AWS re:Invent 2023 - How Lockheed Martin builds software faster, powered by DevSecOps ](https://www.youtube.com/watch?v=Q1OSyxYkl5w)
+ [AWS re:Invent 2023 - How GitHub operationalizes AI for team collaboration and productivity ](https://www.youtube.com/watch?v=cOVvGaiusOI)

# OPS05-BP02 Test and validate changes
<a name="ops_dev_integ_test_val_chg"></a>

 Every change deployed must be tested to avoid errors in production. This best practice is focused on testing changes from version control to artifact build. Besides application code changes, testing should include infrastructure, configuration, security controls, and operations procedures. Testing takes many forms, from unit tests to software component analysis (SCA). Move tests further to the left in the software integration and delivery process results in higher certainty of artifact quality. 

 Your organization must develop testing standards for all software artifacts. Automated tests reduce toil and avoid manual test errors. Manual tests may be necessary in some cases. Developers must have access to automated test results to create feedback loops that improve software quality. 

 **Desired outcome:** Your software changes are tested before they are delivered. Developers have access to test results and validations. Your organization has a testing standard that applies to all software changes. 

 **Common anti-patterns:** 
+  You deploy a new software change without any tests. It fails to run in production, which leads to an outage. 
+  New security groups are deployed with AWS CloudFormation without being tested in a pre-production environment. The security groups make your app unreachable for your customers. 
+  A method is modified but there are no unit tests. The software fails when it is deployed to production. 

 **Benefits of establishing this best practice:** Change fail rate of software deployments are reduced. Software quality is improved. Developers have increased awareness on the viability of their code. Security policies can be rolled out with confidence to support organization's compliance. Infrastructure changes such as automatic scaling policy updates are tested in advance to meet traffic needs. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Testing is done on all changes, from application code to infrastructure, as part of your continuous integration practice. Test results are published so that developers have fast feedback. Your organization has a testing standard that all changes must pass. 

 Use the power of generative AI with Amazon Q Developer to improve developer productivity and code quality. Amazon Q Developer includes generation of code suggestions (based on large language models), production of unit tests (including boundary conditions), and code security enhancements through detection and remediation of security vulnerabilities. 

 **Customer example** 

 As part of their continuous integration pipeline, AnyCompany Retail conducts several types of tests on all software artifacts. They practice test driven development so all software has unit tests. Once the artifact is built, they run end-to-end tests. After this first round of tests is complete, they run a static application security scan, which looks for known vulnerabilities. Developers receive messages as each testing gate is passed. Once all tests are complete, the software artifact is stored in an artifact repository. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Work with stakeholders in your organization to develop a testing standard for software artifacts. What standard tests should all artifacts pass? Are there compliance or governance requirements that must be included in the test coverage? Do you need to conduct code quality tests? When tests complete, who needs to know? 

   1.  The [AWS Deployment Pipeline Reference Architecture](https://pipelines.devops.aws.dev/) contains an authoritative list of types of tests that can be conducted on software artifacts as part of an integration pipeline. 

1.  Instrument your application with the necessary tests based on your software testing standard. Each set of tests should complete in under ten minutes. Tests should run as part of an integration pipeline. 

   1.  Use [Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/what-is.html), a generative AI tool that can help create unit test cases (including boundary conditions), generate functions using code and comments, and implement well-known algorithms. 

   1.  Use [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) to test your application code for defects. 

   1.  You can use [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) to conduct tests on software artifacts. 

   1.  [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) can orchestrate your software tests into a pipeline. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP01 Use version control](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_dev_integ_version_control.html) 
+  [OPS05-BP06 Share design standards](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_dev_integ_share_design_stds.html) 
+  [OPS05-BP07 Implement practices to improve code quality](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_dev_integ_code_quality.html) 
+  [OPS05-BP10 Fully automate integration and deployment](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_dev_integ_auto_integ_deploy.html) 

 **Related documents:** 
+  [Adopt a test-driven development approach](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html) 
+  [Accelerate your Software Development Lifecycle with Amazon Q](https://aws.amazon.com/blogs/devops/accelerate-your-software-development-lifecycle-with-amazon-q/) 
+  [Amazon Q Developer, now generally available, includes previews of new capabilities to reimagine developer experience](https://aws.amazon.com/blogs/aws/amazon-q-developer-now-generally-available-includes-new-capabilities-to-reimagine-developer-experience/) 
+  [The Ultimate Cheat Sheet for Using Amazon Q Developer in Your IDE](https://community.aws/content/2eYoqeFRqaVnk900emsknDfzhfW/the-ultimate-cheat-sheet-for-using-amazon-q-developer-in-your-ide) 
+  [Shift-Left Workload, leveraging AI for Test Creation](https://community.aws/content/2gBZtC94gPzaCQRnt4P0rIYWuBx/shift-left-workload-leveraging-ai-for-test-creation) 
+  [Amazon Q Developer Center](https://aws.amazon.com/developer/generative-ai/amazon-q/) 
+  [10 ways to build applications faster with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/10-ways-to-build-applications-faster-with-amazon-codewhisperer/) 
+  [Looking beyond code coverage with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/looking-beyond-code-coverage-with-amazon-codewhisperer/) 
+  [Best Practices for Prompt Engineering with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/best-practices-for-prompt-engineering-with-amazon-codewhisperer/) 
+  [Automated AWS CloudFormation Testing Pipeline with TaskCat and CodePipeline](https://aws.amazon.com/blogs/devops/automated-cloudformation-testing-pipeline-with-taskcat-and-codepipeline/) 
+  [Building end-to-end AWS DevSecOps CI/CD pipeline with open source SCA, SAST, and DAST tools](https://aws.amazon.com/blogs/devops/building-end-to-end-aws-devsecops-ci-cd-pipeline-with-open-source-sca-sast-and-dast-tools/) 
+  [Getting started with testing serverless applications](https://aws.amazon.com/blogs/compute/getting-started-with-testing-serverless-applications/) 
+  [My CI/CD pipeline is my release captain](https://aws.amazon.com/builders-library/cicd-pipeline/) 
+  [Practicing Continuous Integration and Continuous Delivery on AWS Whitepaper](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/welcome.html) 

 **Related videos:** 
+  [Implement an API with Amazon Q Developer Agent for Software Development](https://www.youtube.com/watch?v=U4XEvJUvff4) 
+  [Installing, Configuring, & Using Amazon Q Developer with JetBrains IDEs (How-to)](https://www.youtube.com/watch?v=-iQfIhTA4J0) 
+  [Mastering the art of Amazon CodeWhisperer - YouTube playlist](https://www.youtube.com/playlist?list=PLDqi6CuDzubxzL-yIqgQb9UbbceYdKhpK) 
+  [AWS re:Invent 2020: Testable infrastructure: Integration testing on AWS](https://www.youtube.com/watch?v=KJC380Juo2w) 
+  [AWS Summit ANZ 2021 - Driving a test-first strategy with CDK and test driven development](https://www.youtube.com/watch?v=1R7G_wcyd3s) 
+  [Testing Your Infrastructure as Code with AWS CDK](https://www.youtube.com/watch?v=fWtuwGSoSOU) 

 **Related resources:** 
+  [AWS Deployment Pipeline Reference Architecture - Application](https://pipelines.devops.aws.dev/application-pipeline/index.html) 
+  [AWS Kubernetes DevSecOps Pipeline](https://github.com/aws-samples/devsecops-cicd-containers) 
+  [Run unit tests for a Node.js application from GitHub by using AWS CodeBuild](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/run-unit-tests-for-a-node-js-application-from-github-by-using-aws-codebuild.html) 
+  [Use Serverspec for test-driven development of infrastructure code](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/use-serverspec-for-test-driven-development-of-infrastructure-code.html) 

 **Related services:** 
+  [Amazon Q Developer](https://aws.amazon.com/q/developer/) 
+  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) 
+  [AWS CodeBuild](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/welcome.html) 

# OPS05-BP03 Use configuration management systems
<a name="ops_dev_integ_conf_mgmt_sys"></a>

 Use configuration management systems to make and track configuration changes. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes. 

Static configuration management sets values when initializing a resource that are expected to remain consistent throughout the resource’s lifetime. Dynamic configuration management sets values at initialization that can or are expected to change during the lifetime of a resource. For example, you could set a feature toggle to activate functionality in your code through a configuration change, or change the level of log detail during an incident.

Configurations should be deployed in a known and consistent state. You should use automated inspection to continually monitor resource configurations across environments and regions. These controls should be defined as code and management automated to ensure rules are consistently appplied across environments. Changes to configurations should be updated through agreed change control procedures and applied consistently, honoring version control. Application configuration should be managed independently of application and infrastructure code. This allows for consistent deployment across multiple environments. Configuration changes do not result in rebuilding or redeploying the application. 

 **Desired outcome:** You configure, validate, and deploy as part of your continuous integration, continuous delivery (CI/CD) pipeline. You monitor to validate configurations are correct. This minimizes any impact to end users and customers. 

 **Common anti-patterns:** 
+  You manually update the web server configuration across your fleet and a number of servers become unresponsive due to update errors. 
+  You manually update your application server fleet over the course of many hours. The inconsistency in configuration during the change causes unexpected behaviors. 
+  Someone has updated your security groups and your web servers are no longer accessible. Without knowledge of what was changed you spend significant time investigating the issue extending your time to recovery. 
+  You push a pre-production configuration into production through CI/CD without validation. You expose users and customers to incorrect data and services. 

 **Benefits of establishing this best practice:** Adopting configuration management systems reduces the level of effort to make and track changes, and the frequency of errors caused by manual procedures. Configuration management systems provide assurances with regards to governance, compliance, and regulatory requirements. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Configuration management systems are used to track and implement changes to application and environment configurations. Configuration management systems are also used to reduce errors caused by manual processes, make configuration changes repeatable and auditable, and reduce the level of effort. 

 On AWS, you can use [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) to continually monitor your AWS resource configurations [across accounts and Regions](https://docs.aws.amazon.com/config/latest/developerguide/aggregate-data.html). It helps you to track their configuration history, understand how a configuration change would affect other resources, and audit them against expected or desired configurations using [AWS Config Rules](https://docs.aws.amazon.com/config/latest/developerguide/evaluate-config.html) and [AWS Config Conformance Packs](https://docs.aws.amazon.com/config/latest/developerguide/conformance-packs.html). 

 For dynamic configurations in your applications running on Amazon EC2 instances, AWS Lambda, containers, mobile applications, or IoT devices, you can use [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) to configure, validate, deploy, and monitor them across your environments. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Identify configuration owners. 

   1.  Make configurations owners aware of any compliance, governance, or regulatory needs. 

1.  Identify configuration items and deliverables. 

   1.  Configuration items are all application and environmental configurations affected by a deployment within your CI/CD pipeline. 

   1.  Deliverables include success criteria, validation, and what to monitor. 

1.  Select tools for configuration management based on your business requirements and delivery pipeline. 

1.  Consider weighted deployments such as canary deployments for significant configuration changes to minimize the impact of incorrect configurations. 

1.  Integrate your configuration management into your CI/CD pipeline. 

1.  Validate all changes pushed. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) 
+  [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md) 
+  [OPS06-BP03 Employ safe deployment strategies](ops_mit_deploy_risks_deploy_mgmt_sys.md) 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html)
+ [AWS Landing Zone Accelerator ](https://aws.amazon.com/solutions/implementations/landing-zone-accelerator-on-aws/)
+ [AWS Config](https://aws.amazon.com/config/)
+ [ What is AWS Config? ](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html)
+  [AWS AppConfig](https://docs.aws.amazon.com/appconfig/latest/userguide/what-is-appconfig.html) 
+ [ What is AWS CloudFormation? ](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html)
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+ [AWS CodeBuild](https://aws.amazon.com/codebuild/)
+ [AWS CodePipeline](https://aws.amazon.com/codepipeline/)
+ [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)

 **Related videos:** 
+ [AWS re:Invent 2022 - Proactive governance and compliance for AWS workloads ](https://youtu.be/PpUnH9Y52X0?si=82wff87KHXcc6nbT)
+ [AWS re:Invent 2020: Achieve compliance as code using AWS Config](https://youtu.be/m8vTwvbzOfw?si=my4DP0FLq1zwKjho)
+ [ Manage and Deploy Application Configurations with AWS AppConfig](https://youtu.be/ztIxMY3IIu0?si=ovYGsxWOBysyQrg0)

# OPS05-BP04 Use build and deployment management systems
<a name="ops_dev_integ_build_mgmt_sys"></a>

 Use build and deployment management systems. These systems reduce errors caused by manual processes and reduce the level of effort to deploy changes. 

 In AWS, you can build continuous integration/continuous deployment (CI/CD) pipelines using services such as [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) (for example, [AWS CodeBuild](https://aws.amazon.com/codebuild/), [AWS CodePipeline](https://aws.amazon.com/codepipeline/), and [AWS CodeDeploy](https://aws.amazon.com/codedeploy/)). 

 **Desired outcome:** Your build and deployment management systems support your organization's continuous integration continuous delivery (CI/CD) system that provide capabilities for automating safe rollouts with the correct configurations. 

 **Common anti-patterns:** 
+  After compiling your code on your development system, you copy the executable onto your production systems and it fails to start. The local log files indicates that it has failed due to missing dependencies. 
+  You successfully build your application with new features in your development environment and provide the code to quality assurance (QA). It fails QA because it is missing static assets. 
+  On Friday, after much effort, you successfully built your application manually in your development environment including your newly coded features. On Monday, you are unable to repeat the steps that allowed you to successfully build your application. 
+  You perform the tests you have created for your new release. Then you spend the next week setting up a test environment and performing all the existing integration tests followed by the performance tests. The new code has an unacceptable performance impact and must be redeveloped and then retested. 

 **Benefits of establishing this best practice:** By providing mechanisms to manage build and deployment activities you reduce the level of effort to perform repetitive tasks, free your team members to focus on their high value creative tasks, and limit the introduction of error from manual procedures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Build and deployment management systems are used to track and implement change, reduce errors caused by manual processes, and reduce the level of effort required for safe deployments. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, decreases cost, encourages increased frequency of change, reduces the level of effort, and increases collaboration. 

### Implementation steps
<a name="implementation-steps"></a>

![\[Diagram showing a CI/CD pipeline using AWS CodePipeline and related services\]](http://docs.aws.amazon.com/wellarchitected/latest/framework/images/deployment-pipeline-tooling.png)


1.  Use a version control system to store and manage assets (such as documents, source code, and binary files). 

1.  Use CodeBuild to compile your source code, runs unit tests, and produces artifacts that are ready to deploy. 

1.  Use CodeDeploy as a deployment service that automates application deployments to [Amazon EC2](https://aws.amazon.com/ec2/) instances, on-premises instances, [serverless AWS Lambda functions](https://docs.aws.amazon.com/lambda/latest/dg/welcome.html), or [Amazon ECS](https://aws.amazon.com/ecs/). 

1.  Monitor your deployments. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+  [AWS Developer Tools](https://aws.amazon.com/products/developer-tools/) 
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+ [AWS CodeBuild](https://aws.amazon.com/codebuild/)
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+ [AWS re:Invent 2022 - AWS Well-Architected best practices for DevOps on AWS](https://youtu.be/hfXokRAyorA)

# OPS05-BP05 Perform patch management
<a name="ops_dev_integ_patch_mgmt"></a>

 Perform patch management to gain features, address issues, and remain compliant with governance. Automate patch management to reduce errors caused by manual processes, scale, and reduce the level of effort to patch. 

 Patch and vulnerability management are part of your benefit and risk management activities. It is preferable to have immutable infrastructures and deploy workloads in verified known good states. Where that is not viable, patching in place is the remaining option. 

 [AWS Health](https://aws.amazon.com/premiumsupport/technology/aws-health/) is the authoritative source of information about planned lifecycle events and other action-required events that affect the health of your AWS Cloud resources. You should be aware of upcoming changes and updates that should be performed. Major planned lifecycle events are sent at least six months in advance. 

 [Amazon EC2 Image Builder](https://aws.amazon.com/image-builder/) provides pipelines to update machine images. As a part of patch management, consider [Amazon Machine Images](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html       ) (AMIs) using an [AMI image pipeline](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-image-pipeline.html) or container images with a [Docker image pipeline](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-container-pipeline.html), while AWS Lambda provides patterns for [custom runtimes and additional libraries](https://docs.aws.amazon.com/lambda/latest/dg/runtimes-custom.html) to remove vulnerabilities. 

 You should manage updates to [Amazon Machine Images](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html) for Linux or Windows Server images using [Amazon EC2 Image Builder](https://aws.amazon.com/image-builder/). You can use [Amazon Elastic Container Registry (Amazon ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) with your existing pipeline to manage Amazon ECS images and manage Amazon EKS images. Lambda includes [version management features](https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html). 

 Patching should not be performed on production systems without first testing in a safe environment. Patches should only be applied if they support an operational or business outcome. On AWS, you can use [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) to automate the process of patching managed systems and schedule the activity using [Systems Manager Maintenance Windows](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-maintenance.html). 

 **Desired outcome:** Your AMI and container images are patched, up-to-date, and ready for launch. You are able to track the status of all deployed images and know patch compliance. You are able to report on current status and have a process to meet your compliance needs. 

 **Common anti-patterns:** 
+  You are given a mandate to apply all new security patches within two hours resulting in multiple outages due to application incompatibility with patches. 
+  An unpatched library results in unintended consequences as unknown parties use vulnerabilities within it to access your workload. 
+  You patch the developer environments automatically without notifying the developers. You receive multiple complaints from the developers that their environment cease to operate as expected. 
+  You have not patched the commercial off-the-shelf software on a persistent instance. When you have an issue with the software and contact the vendor, they notify you that version is not supported and you have to patch to a specific level to receive any assistance. 
+  A recently released patch for the encryption software you used has significant performance improvements. Your unpatched system has performance issues that remain in place as a result of not patching. 
+  You are notified of a zero-day vulnerability requiring an emergency fix and you have to patch all your environments manually. 
+  You are not aware of critical actions needed to maintain your resources, such as mandatory version updates because you do not review upcoming planned lifecycle events and other information. You lose critical time for planning and execution, resulting in emergency changes for your teams and potential impact or unexpected downtime. 

 **Benefits of establishing this best practice:** By establishing a patch management process, including your criteria for patching and methodology for distribution across your environments, you can scale and report on patch levels. This provides assurances around security patching and ensure clear visibility on the status of known fixes being in place. This encourages adoption of desired features and capabilities, the rapid removal of issues, and sustained compliance with governance. Implement patch management systems and automation to reduce the level of effort to deploy patches and limit errors caused by manual processes. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Patch systems to remediate issues, to gain desired features or capabilities, and to remain compliant with governance policy and vendor support requirements. In immutable systems, deploy with the appropriate patch set to achieve the desired result. Automate the patch management mechanism to reduce the elapsed time to patch, to avoid errors caused by manual processes, and lower the level of effort to patch. 

### Implementation steps
<a name="implementation-steps"></a>

 For Amazon EC2 Image Builder: 

1.  Using Amazon EC2 Image Builder, specify pipeline details: 

   1.  Create an image pipeline and name it 

   1.  Define pipeline schedule and time zone 

   1.  Configure any dependencies 

1.  Choose a recipe: 

   1.  Select existing recipe or create a new one 

   1.  Select image type 

   1.  Name and version your recipe 

   1.  Select your base image 

   1.  Add build components and add to target registry 

1.  Optional - define your infrastructure configuration. 

1.  Optional - define configuration settings. 

1.  Review settings. 

1.  Maintain recipe hygiene regularly. 

 For Systems Manager Patch Manager: 

1.  Create a patch baseline. 

1.  Select a patching operations method. 

1.  Enable compliance reporting and scanning. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [ What is Amazon EC2 Image Builder ](https://docs.aws.amazon.com/imagebuilder/latest/userguide/what-is-image-builder.html)
+ [ Create an image pipeline using the Amazon EC2 Image Builder ](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-image-pipeline.html)
+ [ Create a container image pipeline ](https://docs.aws.amazon.com/imagebuilder/latest/userguide/start-build-container-pipeline.html)
+  [AWS Systems Manager Patch Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) 
+ [ Working with Patch Manager ](https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager-console.html)
+ [ Working with patch compliance reports ](https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager-compliance-reports.html)
+ [AWS Developer Tools ](https://aws.amazon.com/products/developer-tools)

 **Related videos:** 
+  [CI/CD for Serverless Applications on AWS](https://www.youtube.com/watch?v=tEpx5VaW4WE) 
+  [Design with Ops in Mind](https://youtu.be/uh19jfW7hw4) 

   **Related examples:** 
+ [AWS Systems Manager Patch Manager tutorials ](https://docs.aws.amazon.com/systems-manager/latest/userguide/patch-manager-tutorials.html)

# OPS05-BP06 Share design standards
<a name="ops_dev_integ_share_design_stds"></a>

 Share best practices across teams to increase awareness and maximize the benefits of development efforts. Document them and keep them up to date as your architecture evolves. If shared standards are enforced in your organization, it’s critical that mechanisms exist to request additions, changes, and exceptions to standards. Without this option, standards become a constraint on innovation. 

 **Desired outcome:** Design standards are shared across teams in your organizations. They are documented and kept up-to-date as best practices evolve. 

 **Common anti-patterns:** 
+ Two development teams have each created a user authentication service. Your users must maintain a separate set of credentials for each part of the system they want to access. 
+ Each team manages their own infrastructure. A new compliance requirement forces a change to your infrastructure and each team implements it in a different way.

 **Benefits of establishing this best practice:** Using shared standards supports the adoption of best practices and maximizes the benefits of development efforts. Documenting and updating design standards keeps your organization up-to-date with best practices and security and compliance requirements. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Share existing best practices, design standards, checklists, operating procedures, guidance, and governance requirements across teams. Have procedures to request changes, additions, and exceptions to design standards to support improvement and innovation. Make teams are aware of published content. Have a mechanism to keep design standards up-to-date as new best practices emerge. 

 **Customer example** 

 AnyCompany Retail has a cross-functional architecture team that creates software architecture patterns. This team builds the architecture with compliance and governance built in. Teams that adopt these shared standards get the benefits of having compliance and governance built in. They can quickly build on top of the design standard. The architecture team meets quarterly to evaluate architecture patterns and update them if necessary. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Identify a cross-functional team that owns developing and updating design standards. This team should work with stakeholders across your organization to develop design standards, operating procedures, checklists, guidance, and governance requirements. Document the design standards and share them within your organization. 

   1.  [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) can be used to create portfolios representing design standards using infrastructure as code. You can share portfolios across accounts. 

1.  Have a mechanism in place to keep design standards up-to-date as new best practices are identified. 

1.  If design standards are centrally enforced, have a process to request changes, updates, and exemptions. 

 **Level of effort for the implementation plan:** Medium. Developing a process to create and share design standards can take coordination and cooperation with stakeholders across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) - Governance requirements influence design standards. 
+  [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) - Compliance is a vital input in creating design standards. 
+  [OPS07-BP02 Ensure a consistent review of operational readiness](ops_ready_to_support_const_orr.md) - Operational readiness checklists are a mechanism to implement design standards when designing your workload. 
+  [OPS11-BP01 Have a process for continuous improvement](ops_evolve_ops_process_cont_imp.md) - Updating design standards is a part of continuous improvement. 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md) - As part of your knowledge management practice, document and share design standards. 

 **Related documents:** 
+ [ Automate AWS Backups with AWS Service Catalog](https://aws.amazon.com/blogs/mt/automate-aws-backups-with-aws-service-catalog/)
+ [AWS Service Catalog Account Factory-Enhanced ](https://aws.amazon.com/blogs/mt/aws-service-catalog-account-factory-enhanced/)
+ [ How Expedia Group built Database as a Service (DBaaS) offering using AWS Service Catalog](https://aws.amazon.com/blogs/mt/how-expedia-group-built-database-as-a-service-dbaas-offering-using-aws-service-catalog/)
+ [ Maintain visibility over the use of cloud architecture patterns ](https://aws.amazon.com/blogs/architecture/maintain-visibility-over-the-use-of-cloud-architecture-patterns/)
+ [ Simplify sharing your AWS Service Catalog portfolios in an AWS Organizations setup ](https://aws.amazon.com/blogs/mt/simplify-sharing-your-aws-service-catalog-portfolios-in-an-aws-organizations-setup/)

 **Related videos:** 
+ [AWS Service Catalog – Getting Started ](https://www.youtube.com/watch?v=A9kKy6WhqVA)
+ [AWS re:Invent 2020: Manage your AWS Service Catalog portfolios like an expert ](https://www.youtube.com/watch?v=lVfXkWHAtR8)

 **Related examples:** 
+ [AWS Service Catalog Reference Architecture ](https://github.com/aws-samples/aws-service-catalog-reference-architectures)
+ [AWS Service Catalog Workshop ](https://catalog.us-east-1.prod.workshops.aws/workshops/d40750d7-a330-49be-9945-cde864610de9/en-US)

 **Related services:** 
+  [AWS Service Catalog](https://docs.aws.amazon.com/servicecatalog/latest/adminguide/introduction.html) 

# OPS05-BP07 Implement practices to improve code quality
<a name="ops_dev_integ_code_quality"></a>

 Implement practices to improve code quality and minimize defects. Some examples include test-driven development, code reviews, standards adoption, and pair programming. Incorporate these practices into your continuous integration and delivery process. 

 **Desired outcome:** Your organization uses best practices like code reviews or pair programming to improve code quality. Developers and operators adopt code quality best practices as part of the software development lifecycle. 

 **Common anti-patterns:** 
+  You commit code to the main branch of your application without a code review. The change automatically deploys to production and causes an outage. 
+  A new application is developed without any unit, end-to-end, or integration tests. There is no way to test the application before deployment. 
+  Your teams make manual changes in production to address defects. Changes do not go through testing or code reviews and are not captured or logged through continuous integration and delivery processes. 

 **Benefits of establishing this best practice:** By adopting practices to improve code quality, you can help minimize issues introduced to production. Code quality best practices include pair programming, code reviews, and implementation of AI productivity tools. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Implement practices to improve code quality to minimize defects before they are deployed. Use practices like test-driven development, code reviews, and pair programming to increase the quality of your development. 

 Use the power of generative AI with Amazon Q Developer to improve developer productivity and code quality. Amazon Q Developer includes generation of code suggestions (based on large language models), production of unit tests (including boundary conditions), and code security enhancements through detection and remediation of security vulnerabilities. 

 **Customer example** 

 AnyCompany Retail adopts several practices to improve code quality. They have adopted test-driven development as the standard for writing applications. For some new features, they will have developers pair program together during a sprint. Every pull request goes through a code review by a senior developer before being integrated and deployed. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Adopt code quality practices like test-driven development, code reviews, and pair programming into your continuous integration and delivery process. Use these techniques to improve software quality. 

   1.  Use [Amazon Q Developer](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/what-is.html), a generative AI tool that can help create unit test cases (including boundary conditions), generate functions using code and comments, implement well-known algorithms, detect security policy violations and vulnerabilities in your code, detect secrets, scan infrastructure as code (IaC), document code, and learn third-party code libraries more quickly. 

   1.  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) can provide programming recommendations for Java and Python code using machine learning. 

 **Level of effort for the implementation plan:** Medium. There are many ways of implementing this best practice, but getting organizational adoption may be challenging. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP02 Test and validate changes](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_dev_integ_test_val_chg.html) 
+  [OPS05-BP06 Share design standards](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_dev_integ_share_design_stds.html) 

 **Related documents:** 
+  [Adopt a test-driven development approach](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html) 
+  [Accelerate your Software Development Lifecycle with Amazon Q](https://aws.amazon.com/blogs/devops/accelerate-your-software-development-lifecycle-with-amazon-q/) 
+  [Amazon Q Developer, now generally available, includes previews of new capabilities to reimagine developer experience](https://aws.amazon.com/blogs/aws/amazon-q-developer-now-generally-available-includes-new-capabilities-to-reimagine-developer-experience/) 
+  [The Ultimate Cheat Sheet for Using Amazon Q Developer in Your IDE](https://community.aws/content/2eYoqeFRqaVnk900emsknDfzhfW/the-ultimate-cheat-sheet-for-using-amazon-q-developer-in-your-ide) 
+  [Shift-Left Workload, leveraging AI for Test Creation](https://community.aws/content/2gBZtC94gPzaCQRnt4P0rIYWuBx/shift-left-workload-leveraging-ai-for-test-creation) 
+  [Amazon Q Developer Center](https://aws.amazon.com/developer/generative-ai/amazon-q/) 
+  [10 ways to build applications faster with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/10-ways-to-build-applications-faster-with-amazon-codewhisperer/) 
+  [Looking beyond code coverage with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/looking-beyond-code-coverage-with-amazon-codewhisperer/) 
+  [Best Practices for Prompt Engineering with Amazon CodeWhisperer](https://aws.amazon.com/blogs/devops/best-practices-for-prompt-engineering-with-amazon-codewhisperer/) 
+  [Agile Software Guide](https://martinfowler.com/agile.html) 
+  [My CI/CD pipeline is my release captain](https://aws.amazon.com/builders-library/cicd-pipeline/) 
+  [Automate code reviews with Amazon CodeGuru Reviewer](https://aws.amazon.com/blogs/devops/automate-code-reviews-with-amazon-codeguru-reviewer/) 
+  [Adopt a test-driven development approach](https://docs.aws.amazon.com/prescriptive-guidance/latest/best-practices-cdk-typescript-iac/development-best-practices.html) 
+  [How DevFactory builds better applications with Amazon CodeGuru](https://aws.amazon.com/blogs/machine-learning/how-devfactory-builds-better-applications-with-amazon-codeguru/) 
+  [On Pair Programming](https://martinfowler.com/articles/on-pair-programming.html) 
+  [RENGA Inc. automates code reviews with Amazon CodeGuru](https://aws.amazon.com/blogs/machine-learning/renga-inc-automates-code-reviews-with-amazon-codeguru/) 
+  [The Art of Agile Development: Test-Driven Development](http://www.jamesshore.com/v2/books/aoad1/test_driven_development) 
+  [Why code reviews matter (and actually save time\$1)](https://www.atlassian.com/agile/software-development/code-reviews) 

 **Related videos:** 
+  [Implement an API with Amazon Q Developer Agent for Software Development](https://www.youtube.com/watch?v=U4XEvJUvff4) 
+  [Installing, Configuring, & Using Amazon Q Developer with JetBrains IDEs (How-to)](https://www.youtube.com/watch?v=-iQfIhTA4J0) 
+  [Mastering the art of Amazon CodeWhisperer - YouTube playlist](https://www.youtube.com/playlist?list=PLDqi6CuDzubxzL-yIqgQb9UbbceYdKhpK) 
+  [AWS re:Invent 2020: Continuous improvement of code quality with Amazon CodeGuru](https://www.youtube.com/watch?v=iX1i35H1OVw) 
+  [AWS Summit ANZ 2021 - Driving a test-first strategy with CDK and test driven development](https://www.youtube.com/watch?v=1R7G_wcyd3s) 

 **Related services:** 
+  [Amazon Q Developer](https://aws.amazon.com/q/developer/) 
+  [Amazon CodeGuru Reviewer](https://docs.aws.amazon.com/codeguru/latest/reviewer-ug/welcome.html) 
+  [Amazon CodeGuru Profiler](https://docs.aws.amazon.com/codeguru/latest/profiler-ug/what-is-codeguru-profiler.html) 

# OPS05-BP08 Use multiple environments
<a name="ops_dev_integ_multi_env"></a>

 Use multiple environments to experiment, develop, and test your workload. Use increasing levels of controls as environments approach production to gain confidence your workload operates as intended when deployed. 

 **Desired outcome:** You have multiple environments that reflect your compliance and governance needs. You test and promote code through environments on your path to production. 

1.  Your organization does this through the establishment of a landing zone, which provides governance, controls, account automations, networking, security, and operational observability. Manage these landing zone capabilities by using multiple environments. A common example is a sandbox organization for developing and testing changes to an [AWS Control Tower](https://aws.amazon.com/controltower/)-based landing zone, which includes [AWS IAM Identity Center](https://aws.amazon.com/iam/identity-center/) and policies such as [service control policies (SCPs)](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html). All of these elements can significantly impact the access to and operation of AWS accounts within the landing zone. 

1.  In addition to these services, your teams extend the landing zones capabilites with solutions published by AWS and AWS partners or as custom solutions developed within your organization. Examples of solutions published by AWS include [Customizations for AWS Control Tower (CfCT)](https://aws.amazon.com/solutions/implementations/customizations-for-aws-control-tower/) and [AWS Control Tower Account Factory for Terraform (AFT)](https://docs.aws.amazon.com/controltower/latest/userguide/aft-overview.html). 

1.  Your organization applies the same principles of testing, promoting code, and policy changes for the landing zone through environments on your path to production. This strategy provides a stable and secure landing zone environment for your application and workload teams. 

 **Common anti-patterns:** 
+  You are performing development in a shared development environment and another developer overwrites your code changes. 
+  The restrictive security controls on your shared development environment are preventing you from experimenting with new services and features. 
+  You perform load testing on your production systems and cause an outage for your users. 
+  A critical error resulting in data loss has occurred in production. In your production environment, you attempt to recreate the conditions that lead to the data loss so that you can identify how it happened and prevent it from happening again. To prevent further data loss during testing, you are forced to make the application unavailable to your users. 
+  You are operating a multi-tenant service and are unable to support a customer request for a dedicated environment. 
+  You may not always test, but when you do, you test in your production environment. 
+  You believe that the simplicity of a single environment overrides the scope of impact of changes within the environment. 
+  You upgrade a key landing zone capability, but the change impairs your team's ability to vend accounts for either new projects or your existing workloads. 
+  You apply new controls to your AWS accounts, but the change impacts your workload team's ability to deploy changes within their AWS accounts. 

 **Benefits of establishing this best practice:** When you deploy multiple environments, you can support multiple simultaneous development, testing, and production environments without creating conflicts between developers or user communities. For complex capabilities such as landing zones, it significantly reduces the risk of changes, simplifies the improvement process, and reduces the risk of critical updates to the environment. Organizations that use landing zones naturally benefit from multi-accounts in their AWS environment, with account structure, governance, network, and security configurations. Over time, as your organization grows, the landing zone can evolve to secure and organize your workloads and resources. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use multiple environments and provide developers sandbox environments with minimized controls to aid in experimentation. Provide individual development environments to help work in parallel, increasing development agility. Implement more rigorous controls in the environments approaching production to allow developers to innovate. Use infrastructure as code and configuration management systems to deploy environments that are configured consistent with the controls present in production to ensure systems operate as expected when deployed. When environments are not in use, turn them off to avoid costs associated with idle resources (for example, development systems on evenings and weekends). Deploy production equivalent environments when load testing to improve valid results. 

 Teams such as platform engineering, networking, and security operations often manage capabilies at the organization level with distinct requirements. A separation of accounts alone is insufficient to provide and maintain separate environments for experimentation, development, and testing. In such cases, create separate instances of AWS Organizations. 

## Resources
<a name="resources"></a>

 **Related documents:** 
+ [ Instance Scheduler on AWS](https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/)
+  [What is AWS CloudFormation?](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) 
+ [ Organizing Your AWS Environment Using Multiple Accounts - Multiple organizations - Test changes to your overall AWS environment ](https://docs.aws.amazon.com/whitepapers/latest/organizing-your-aws-environment/multiple-organizations.html#test-changes-to-your-overall-aws-environment)
+ [AWS Control Tower Guide ](https://catalog.workshops.aws/control-tower)

# OPS05-BP09 Make frequent, small, reversible changes
<a name="ops_dev_integ_freq_sm_rev_chg"></a>

 Frequent, small, and reversible changes reduce the scope and impact of a change. When used in conjunction with change management systems, configuration management systems, and build and delivery systems frequent, small, and reversible changes reduce the scope and impact of a change. This results in more effective troubleshooting and faster remediation with the option to roll back changes. 

 **Common anti-patterns:** 
+  You deploy a new version of your application quarterly with a change window that means a core service is turned off. 
+  You frequently make changes to your database schema without tracking changes in your management systems. 
+  You perform manual in-place updates, overwriting existing installations and configurations, and have no clear roll-back plan. 

 **Benefits of establishing this best practice:** Development efforts are faster by deploying small changes frequently. When the changes are small, it is much easier to identify if they have unintended consequences, and they are easier to reverse. When the changes are reversible, there is less risk to implementing the change, as recovery is simplified. The change process has a reduced risk and the impact of a failed change is reduced. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use frequent, small, and reversible changes to reduce the scope and impact of a change. This eases troubleshooting, helps with faster remediation, and provides the option to roll back a change. It also increases the rate at which you can deliver value to the business. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP03 Use configuration management systems](ops_dev_integ_conf_mgmt_sys.md) 
+  [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md) 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [ Implementing Microservices on AWS](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/microservices-on-aws.html)
+ [ Microservices - Observability ](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/observability.html)

# OPS05-BP10 Fully automate integration and deployment
<a name="ops_dev_integ_auto_integ_deploy"></a>

 Automate build, deployment, and testing of the workload. This reduces errors caused by manual processes and reduces the effort to deploy changes. 

 Apply metadata using [Resource Tags](https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html) and [AWS Resource Groups](https://docs.aws.amazon.com/ARG/latest/APIReference/Welcome.html) following a consistent [tagging strategy](https://aws.amazon.com/answers/account-management/aws-tagging-strategies/) to aid in identification of your resources. Tag your resources for organization, cost accounting, access controls, and targeting the run of automated operations activities. 

 **Desired outcome:** Developers use tools to deliver code and promote through to production. Developers do not have to log into the AWS Management Console to deliver updates. There is a full audit trail of change and configuration, meeting the needs of governance and compliance. Processes are repeatable and are standardized across teams. Developers are free to focus on development and code pushes, increasing productivity. 

 **Common anti-patterns:** 
+  On Friday, you finish authoring the new code for your feature branch. On Monday, after running your code quality test scripts and each of your unit tests scripts, you check in your code for the next scheduled release. 
+  You are assigned to code a fix for a critical issue impacting a large number of customers in production. After testing the fix, you commit your code and email change management to request approval to deploy it to production. 
+  As a developer, you log into the AWS Management Console to create a new development environment using non-standard methods and systems. 

 **Benefits of establishing this best practice:** By implementing automated build and deployment management systems, you reduce errors caused by manual processes and reduce the effort to deploy changes helping your team members to focus on delivering business value. You increase the speed of delivery as you promote through to production. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 You use build and deployment management systems to track and implement change, to reduce errors caused by manual processes, and reduce the level of effort. Fully automate the integration and deployment pipeline from code check-in through build, testing, deployment, and validation. This reduces lead time, encourages increased frequency of change, reduces the level of effort, increases the speed to market, results in increased productivity, and increases the security of your code as you promote through to production. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP03 Use configuration management systems](ops_dev_integ_conf_mgmt_sys.md) 
+  [OPS05-BP04 Use build and deployment management systems](ops_dev_integ_build_mgmt_sys.md) 

 **Related documents:** 
+  [What is AWS CodeBuild?](https://docs.aws.amazon.com/codebuild/latest/userguide/welcome.html) 
+  [What is AWS CodeDeploy?](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) 

 **Related videos:** 
+ [AWS re:Invent 2022 - AWS Well-Architected best practices for DevOps on AWS](https://youtu.be/hfXokRAyorA)

# OPS 6. How do you mitigate deployment risks?
<a name="ops-06"></a>

 Adopt approaches that provide fast feedback on quality and achieve rapid recovery from changes that do not have desired outcomes. Using these practices mitigates the impact of issues introduced through the deployment of changes. 

**Topics**
+ [

# OPS06-BP01 Plan for unsuccessful changes
](ops_mit_deploy_risks_plan_for_unsucessful_changes.md)
+ [

# OPS06-BP02 Test deployments
](ops_mit_deploy_risks_test_val_chg.md)
+ [

# OPS06-BP03 Employ safe deployment strategies
](ops_mit_deploy_risks_deploy_mgmt_sys.md)
+ [

# OPS06-BP04 Automate testing and rollback
](ops_mit_deploy_risks_auto_testing_and_rollback.md)

# OPS06-BP01 Plan for unsuccessful changes
<a name="ops_mit_deploy_risks_plan_for_unsucessful_changes"></a>

Plan to revert to a known good state, or remediate in the production environment if the deployment causes an undesired outcome. Having a policy to establish such a plan helps all teams develop strategies to recover from failed changes. Some example strategies are deployment and rollback steps, change policies, feature flags, traffic isolation, and traffic shifting. A single release may include multiple related component changes. The strategy should provide the ability to withstand or recover from a failure of any component change.

 **Desired outcome:** You have prepared a detailed recovery plan for your change in the event it is unsuccessful. In addition, you have reduced the size of your release to minimize the potential impact on other workload components. As a result, you have reduced your business impact by shortening the potential downtime caused by a failed change and increased the flexibility and efficiency of recovery times. 

 **Common anti-patterns:** 
+  You performed a deployment and your application has become unstable but there appear to be active users on the system. You have to decide whether to rollback the change and impact the active users or wait to rollback the change knowing the users may be impacted regardless. 
+  After making a routine change, your new environments are accessible, but one of your subnets has become unreachable. You have to decide whether to rollback everything or try to fix the inaccessible subnet. While you are making that determination, the subnet remains unreachable. 
+  Your systems are not architected in a way that allows them to be updated with smaller releases. As a result, you have difficulty in reversing those bulk changes during a failed deployment. 
+  You do not use infrastructure as code (IaC) and you made manual updates to your infrastructure that resulted in an undesired configuration. You are unable to effectively track and revert the manual changes. 
+  Because you have not measured increased frequency of your deployments, your team is not incentivized to reduce the size of their changes and improve their rollback plans for each change, leading to more risk and increased failure rates. 
+  You do not measure the total duration of an outage caused by unsuccessful changes. Your team is unable to prioritize and improve its deployment process and recovery plan effectiveness. 

 **Benefits of establishing this best practice:** Having a plan to recover from unsuccessful changes minimizes the mean time to recover (MTTR) and reduces your business impact. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 A consistent, documented policy and practice adopted by release teams allows an organization to plan what should happen if unsuccessful changes occur. The policy should allow for fixing forward in specific circumstances. In either situation, a fix forward or rollback plan should be well documented and tested before deployment to live production so that the time it takes to revert a change is minimized. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Document the policies that require teams to have effective plans to reverse changes within a specified period. 

   1.  Policies should specify when a fix-forward situation is allowed. 

   1.  Require a documented rollback plan to be accessible by all involved. 

   1.  Specify the requirements to rollback (for example, when it is found that unauthorized changes have been deployed). 

1.  Analyze the level of impact of all changes related to each component of a workload. 

   1.  Allow repeatable changes to be standardized, templated, and preauthorized if they follow a consistent workflow that enforces change policies. 

   1.  Reduce the potential impact of any change by making the size of the change smaller so recovery takes less time and causes less business impact. 

   1.  Ensure rollback procedures revert code to the known good state to avoid incidents where possible. 

1.  Integrate tools and workflows to enforce your policies programatically. 

1.  Make data about changes visible to other workload owners to improve the speed of diagnosis of any failed change that cannot be rolled back. 

   1.  Measure success of this practice using visible change data and identify iterative improvements. 

1.  Use monitoring tools to verify the success or failure of a deployment to speed up decision-making on rolling back. 

1.  Measure your duration of outage during an unsuccessful change to continually improve your recovery plans. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP04 Automate testing and rollback](ops_mit_deploy_risks_auto_testing_and_rollback.md) 

 **Related documents:** 
+ [AWS Builders Library \$1 Ensuring Rollback Safety During Deployments ](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/)
+ [AWS Whitepaper \$1 Change Management in the Cloud ](https://docs.aws.amazon.com/whitepapers/latest/change-management-in-the-cloud/change-management-in-the-cloud.html)

 **Related videos:** 
+ [ re:Invent 2019 \$1 Amazon’s approach to high-availability deployment ](https://aws.amazon.com/builders-library/amazon-approach-to-high-availability-deployment/)

# OPS06-BP02 Test deployments
<a name="ops_mit_deploy_risks_test_val_chg"></a>

 Test release procedures in pre-production by using the same deployment configuration, security controls, steps, and procedures as in production. Validate that all deployed steps are completed as expected, such as inspecting files, configurations, and services. Further test all changes with functional, integration, and load tests, along with any monitoring such as health checks. By doing these tests, you can identify deployment issues early with an opportunity to plan and mitigate them prior to production. 

 You can create temporary parallel environments for testing every change. Automate the deployment of the test environments using infrastructure as code (IaC) to help reduce amount of work involved and ensure stability, consistency, and faster feature delivery. 

 **Desired outcome:** Your organization adopts a test-driven development culture that includes testing deployments. This ensures teams are focused on delivering business value rather than managing releases. Teams are engaged early upon identification of deployment risks to determine the appropriate course of mitigation. 

 **Common anti-patterns:** 
+  During production releases, untested deployments cause frequent issues that require troubleshooting and escalation. 
+  Your release contains infrastructure as code (IaC) that updates existing resources. You are unsure if the IaC runs successfully or causes impact to the resources. 
+  You deploy a new feature to your application. It doesn't work as intended and there is no visibility until it gets reported by impacted users. 
+  You update your certificates. You accidentally install the certificates to the wrong components, which goes undetected and impacts website visitors because a secure connection to the website can't be established. 

 **Benefits of establishing this best practice:** Extensive testing in pre-production of deployment procedures, and the changes introduced by them, minimizes the potential impact to production caused by the deployments steps. This increases confidence during production release and minimizes operational support without slowing down velocity of the changes being delivered. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Testing your deployment process is as important as testing the changes that result from your deployment. This can be achieved by testing your deployment steps in a pre-production environment that mirrors production as closely as possible. Common issues, such as incomplete or incorrect deployment steps, or misconfigurations, can be caught as a result before going to production. In addition, you can test your recovery steps. 

 **Customer example** 

 As part of their continuous integration and continuous delivery (CI/CD) pipeline, AnyCompany Retail performs the defined steps needed to release infrastructure and software updates for its customers in a production-like environment. The pipeline is comprised of pre-checks to detect drift (detecting changes to resources performed outside of your IaC) in resources prior to deployment, as well as validate actions that the IaC takes upon its initiation. It validates deployment steps, like verifying that certain files and configurations are in place and services are in running states and are responding correctly to health checks on local host before re-registering with the load balancer. Additionally, all changes flag a number of automated tests, such as functional, security, regression, integration, and load tests. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Perform pre-install checks to mirror the pre-production environment to production. 

   1.  Use [drift detection](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-drift.html) to detect when resources have been changed outside of CloudFormation. 

   1.  Use [change sets](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-updating-stacks-changesets.html) to validate that the intent of a stack update matches the actions that CloudFormation takes when the change set is initiated. 

1.  This triggers a manual approval step in [AWS CodePipeline](https://docs.aws.amazon.com/codepipeline/latest/userguide/approvals.html) to authorize the deployment to the pre-production environment. 

1.  Use deployment configurations such as [AWS CodeDeploy AppSpec](https://docs.aws.amazon.com/codedeploy/latest/userguide/application-specification-files.html) files to define deployment and validation steps. 

1.  Where applicable, [integrate AWS CodeDeploy with other AWS services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-aws.html) or [integrate AWS CodeDeploy with partner product and services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-partners.html). 

1.  [Monitor deployments](https://docs.aws.amazon.com/codedeploy/latest/userguide/monitoring.html) using Amazon CloudWatch, AWS CloudTrail, and Amazon SNS event notifications. 

1.  Perform post-deployment automated testing, including functional, security, regression, integration, and load testing. 

1.  [Troubleshoot](https://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html) deployment issues. 

1.  Successful validation of preceding steps should initiate a manual approval workflow to authorize deployment to production. 

 **Level of effort for the implementation plan:** High 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP02 Test and validate changes](ops_dev_integ_test_val_chg.md) 

 **Related documents:** 
+ [AWS Builders' Library \$1 Automating safe, hands-off deployments \$1 Test Deployments ](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/#Test_deployments_in_pre-production_environments)
+ [AWS Whitepaper \$1 Practicing Continuous Integration and Continuous Delivery on AWS](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/testing-stages-in-continuous-integration-and-continuous-delivery.html)
+ [ The Story of Apollo - Amazon's Deployment Engine ](https://www.allthingsdistributed.com/2014/11/apollo-amazon-deployment-engine.html)
+  [How to test and debug AWS CodeDeploy locally before you ship your code](https://aws.amazon.com/blogs/devops/how-to-test-and-debug-aws-codedeploy-locally-before-you-ship-your-code/) 
+ [ Integrating Network Connectivity Testing with Infrastructure Deployment ](https://aws.amazon.com/blogs/networking-and-content-delivery/integrating-network-connectivity-testing-with-infrastructure-deployment/)

 **Related videos:** 
+ [ re:Invent 2020 \$1 Testing software and systems at Amazon ](https://www.youtube.com/watch?v=o1sc3cK9bMU)

 **Related examples:** 
+ [ Tutorial \$1 Deploy and Amazon ECS service with a validation test ](https://docs.aws.amazon.com/codedeploy/latest/userguide/tutorial-ecs-deployment-with-hooks.html)

# OPS06-BP03 Employ safe deployment strategies
<a name="ops_mit_deploy_risks_deploy_mgmt_sys"></a>

 Safe production roll-outs control the flow of beneficial changes with an aim to minimize any perceived impact for customers from those changes. The safety controls provide inspection mechanisms to validate desired outcomes and limit the scope of impact from any defects introduced by the changes or from deployment failures. Safe roll-outs may include strategies such as feature-flags, one-box, rolling (canary releases), immutable, traffic splitting, and blue/green deployments. 

 **Desired outcome:** Your organization uses a continuous integration continuous delivery (CI/CD) system that provides capabilities for automating safe rollouts. Teams are required to use appropriate safe roll-out strategies. 

 **Common anti-patterns:** 
+  You deploy an unsuccessful change to all of production all at once. As a result, all customers are impacted simultaneously. 
+  A defect introduced in a simultaneous deployment to all systems requires an emergency release. Correcting it for all customers takes several days. 
+  Managing production release requires planning and participation of several teams. This puts constraints on your ability to frequently update features for your customers. 
+  You perform a mutable deployment by modifying your existing systems. After discovering that the change was unsuccessful, you are forced to modify the systems again to restore the old version, extending your time to recovery. 

 **Benefits of establishing this best practice:** Automated deployments balance speed of roll-outs against delivering beneficial changes consistently to customers. Limiting impact prevents costly deployment failures and maximizes teams ability to efficiently respond to failures. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Continuous-delivery failures can lead to reduced service availability and bad customer experiences. To maximize the rate of successful deployments, implement safety controls in the end-to-end release process to minimize deployment errors, with a goal of achieving zero deployment failures. 

 **Customer example** 

 AnyCompany Retail is on a mission to achieve minimal to zero downtime deployments, meaning that there's no perceivable impact to its users during deployments. To accomplish this, the company has established deployment patterns (see the following workflow diagram), such as rolling and blue/green deployments. All teams adopt one or more of these patterns in their CI/CD pipeline. 


| CodeDeploy workflow for Amazon EC2 | CodeDeploy workflow for Amazon ECS | CodeDeploy workflow for Lambda | 
| --- | --- | --- | 
|  ![\[Deployment process flow for Amazon EC2\]](http://docs.aws.amazon.com/wellarchitected/latest/framework/images/deployment-process-ec2.png)  |  ![\[Deployment process flow for Amazon ECS\]](http://docs.aws.amazon.com/wellarchitected/latest/framework/images/deployment-process-ecs.png)  |  ![\[Deployment process flow for Lambda\]](http://docs.aws.amazon.com/wellarchitected/latest/framework/images/deployment-process-lambda.png)  | 

### Implementation steps
<a name="implementation-steps"></a>

1.  Use an approval workflow to initiate the sequence of production roll-out steps upon promotion to production . 

1.  Use an automated deployment system such as [AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html). AWS CodeDeploy [deployment options](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps.html) include in-place deployments for EC2/On-Premises and blue/green deployments for EC2/On-Premises, AWS Lambda, and Amazon ECS (see the preceding workflow diagram). 

   1.  Where applicable, [integrate AWS CodeDeploy with other AWS services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-aws.html) or [integrate AWS CodeDeploy with partner product and services](https://docs.aws.amazon.com/codedeploy/latest/userguide/integrations-partners.html). 

1.  Use blue/green deployments for databases such as [Amazon Aurora](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/blue-green-deployments.html) and [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-green-deployments.html). 

1.  [Monitor deployments](https://docs.aws.amazon.com/codedeploy/latest/userguide/monitoring.html) using Amazon CloudWatch, AWS CloudTrail, and Amazon Simple Notification Service (Amazon SNS) event notifications. 

1.  Perform post-deployment automated testing including functional, security, regression, integration, and any load tests. 

1.  [Troubleshoot](https://docs.aws.amazon.com/codedeploy/latest/userguide/troubleshooting.html) deployment issues. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS05-BP02 Test and validate changes](ops_dev_integ_test_val_chg.md) 
+  [OPS05-BP09 Make frequent, small, reversible changes](ops_dev_integ_freq_sm_rev_chg.md) 
+  [OPS05-BP10 Fully automate integration and deployment](ops_dev_integ_auto_integ_deploy.md) 

 **Related documents:** 
+ [AWS Builders Library \$1 Automating safe, hands-off deployments \$1 Production deployments ](https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/?did=ba_card&trk=ba_card#Production_deployments)
+ [AWS Builders Library \$1 My CI/CD pipeline is my release captain \$1 Safe, automatic production releases](https://aws.amazon.com//builders-library/cicd-pipeline/#Safe.2C_automatic_production_releases)
+ [AWS Whitepaper \$1 Practicing Continuous Integration and Continuous Delivery on AWS \$1 Deployment methods](https://docs.aws.amazon.com/whitepapers/latest/practicing-continuous-integration-continuous-delivery/deployment-methods.html)
+ [AWS CodeDeploy User Guide](https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html)
+ [Working with deployment configurations in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-configurations.html)
+ [Set up an API Gateway canary release deployment ](https://docs.aws.amazon.com/apigateway/latest/developerguide/canary-release.html)
+ [Amazon ECS Deployment Types](https://docs.aws.amazon.com/)
+ [Fully Managed Blue/Green Deployments in Amazon Aurora and Amazon RDS](https://aws.amazon.com/blogs/aws/new-fully-managed-blue-green-deployments-in-amazon-aurora-and-amazon-rds/)
+ [Blue/Green deployments with AWS Elastic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html)

 **Related videos:** 
+ [re:Invent 2020 \$1 Hands-off: Automating continuous delivery pipelines at Amazon](https://www.youtube.com/watch?v=ngnMj1zbMPY)
+ [re:Invent 2019 \$1 Amazon's Approach to high-availability deployment](https://www.youtube.com/watch?v=bCgD2bX1LI4)

 **Related examples:** 
+ [Try a Sample Blue/Green Deployment in AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/applications-create-blue-green.html)
+ [ Workshop \$1 Building CI/CD pipelines for Lambda canary deployments using AWS CDK](https://catalog.workshops.aws/cdk-cicd-for-lambda-canary-deployment/en-US) 
+ [ Workshop \$1 Building your first DevOps Blue/Green pipeline with Amazon ECS ](https://catalog.us-east-1.prod.workshops.aws/workshops/4b59b9fb-48b6-461c-9377-907b2e33c9df/en-US)
+ [ Workshop \$1 Building your first DevOps Blue/Green pipeline with Amazon EKS ](https://catalog.us-east-1.prod.workshops.aws/workshops/4eab6682-09b2-43e5-93d4-1f58fd6cff6e/en-US)
+ [ Workshop \$1 EKS GitOps with ArgoCD ](https://catalog.workshops.aws/eksgitops-argocd-githubactions)
+ [ Workshop \$1 CI/CD on AWS Workshop ](https://catalog.workshops.aws/cicdonaws/en-US)
+ [ Implementing cross-account CI/CD with AWS SAM for container-based Lambda functions ](https://aws.amazon.com/blogs/compute/implementing-cross-account-cicd-with-aws-sam-for-container-based-lambda/)

# OPS06-BP04 Automate testing and rollback
<a name="ops_mit_deploy_risks_auto_testing_and_rollback"></a>

 To increase the speed, reliability, and confidence of your deployment process, have a strategy for automated testing and rollback capabilities in pre-production and production environments. Automate testing when deploying to production to simulate human and system interactions that verify the changes being deployed. Automate rollback to revert back to a previous known good state quickly. The rollback should be initiated automatically on pre-defined conditions such as when the desired outcome of your change is not achieved or when the automated test fails. Automating these two activities improves your success rate for your deployments, minimizes recovery time, and reduces the potential impact to the business. 

 **Desired outcome:** Your automated tests and rollback strategies are integrated into your continuous integration, continuous delivery (CI/CD) pipeline. Your monitoring is able to validate against your success criteria and initiate automatic rollback upon failure. This minimizes any impact to end users and customers. For example, when all testing outcomes have been satisfied, you promote your code into the production environment where automated regression testing is initiated, leveraging the same test cases. If regression test results do not match expectations, then automated rollback is initiated in the pipeline workflow. 

 **Common anti-patterns:** 
+  Your systems are not architected in a way that allows them to be updated with smaller releases. As a result, you have difficulty in reversing those bulk changes during a failed deployment. 
+  Your deployment process consists of a series of manual steps. After you deploy changes to your workload, you start post-deployment testing. After testing, you realize that your workload is inoperable and customers are disconnected. You then begin rolling back to the previous version. All of these manual steps delay overall system recovery and cause a prolonged impact to your customers. 
+  You spent time developing automated test cases for functionality that is not frequently used in your application, minimizing the return on investment in your automated testing capability. 
+  Your release is comprised of application, infrastructure, patches and configuration updates that are independent from one another. However, you have a single CI/CD pipeline that delivers all changes at once. A failure in one component forces you to revert all changes, making your rollback complex and inefficient. 
+  Your team completes the coding work in sprint one and begins sprint two work, but your plan did not include testing until sprint three. As a result, automated tests revealed defects from sprint one that had to be resolved before testing of sprint two deliverables could be started and the entire release is delayed, devaluing your automated testing. 
+  Your automated regression test cases for the production release are complete, but you are not monitoring workload health. Since you have no visibility into whether or not the service has restarted, you are not sure if rollback is needed or if it has already occurred. 

 **Benefits of establishing this best practice:** Automated testing increases the transparency of your testing process and your ability to cover more features in a shorter time period. By testing and validating changes in production, you are able to identify issues immediately. Improvement in consistency with automated testing tools allows for better detection of defects. By automatically rolling back to the previous version, the impact on your customers is minimized. Automated rollback ultimately inspires more confidence in your deployment capabilities by reducing business impact. Overall, these capabilities reduce time-to-delivery while ensuring quality. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Automate testing of deployed environments to confirm desired outcomes more quickly. Automate rollback to a previous known good state when pre-defined outcomes are not achieved to minimize recovery time and reduce errors caused by manual processes. Integrate testing tools with your pipeline workflow to consistently test and minimize manual inputs. Prioritize automating test cases, such as those that mitigate the greatest risks and need to be tested frequently with every change. Additionally, automate rollback based on specific conditions that are pre-defined in your test plan. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Establish a testing lifecycle for your development lifecycle that defines each stage of the testing process from requirements planning to test case development, tool configuration, automated testing, and test case closure. 

   1.  Create a workload-specific testing approach from your overall test strategy. 

   1.  Consider a continuous testing strategy where appropriate throughout the development lifecycle. 

1.  Select automated tools for testing and rollback based on your business requirements and pipeline investments. 

1.  Decide which test cases you wish to automate and which should be performed manually. These can be defined based on business value priority of the feature being tested. Align all team members to this plan and verify accountability for performing manual tests. 

   1.  Apply automated testing capabilities to specific test cases that make sense for automation, such as repeatable or frequently run cases, those that require repetitive tasks, or those that are required across multiple configurations. 

   1.  Define test automation scripts as well as the success criteria in the automation tool so continued workflow automation can be initiated when specific cases fail. 

   1.  Define specific failure criteria for automated rollback. 

1.  Prioritize test automation to drive consistent results with thorough test case development where complexity and human interaction have a higher risk of failure. 

1.  Integrate your automated testing and rollback tools into your CI/CD pipeline. 

   1.  Develop clear success criteria for your changes. 

   1.  Monitor and observe to detect these criteria and automatically reverse changes when specific rollback criteria are met. 

1.  Perform different types of automated production testing, such as: 

   1.  A/B testing to show results in comparison to the current version between two user testing groups. 

   1.  Canary testing that allows you to roll out your change to a subset of users before releasing it to all. 

   1.  Feature-flag testing which allows a single feature of the new version at a time to be flagged on and off from outside the application so that each new feature can be validated one at a time. 

   1.  Regression testing to verify new functionality with existing interrelated components. 

1.  Monitor the operational aspects of the application, transactions, and interactions with other applications and components. Develop reports to show success of changes by workload so that you can identify what parts of the automation and workflow can be further optimized. 

   1.  Develop test result reports that help you make quick decisions on whether or not rollback procedures should be invoked. 

   1.  Implement a strategy that allows for automated rollback based upon pre-defined failure conditions that result from one or more of your test methods. 

1.  Develop your automated test cases to allow for reusability across future repeatable changes. 

 **Level of effort for the implementation plan:** Medium 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) 
+  [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md) 

 **Related documents:** 
+ [AWS Builders Library \$1 Ensuring rollback safety during deployments ](https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-deployments/)
+  [Redeploy and rollback a deployment with AWS CodeDeploy](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployments-rollback-and-redeploy.html) 
+ [ 8 best practices when automating your deployments with AWS CloudFormation](https://aws.amazon.com/blogs/infrastructure-and-automation/best-practices-automating-deployments-with-aws-cloudformation/)

 **Related examples:** 
+ [ Serverless UI testing using Selenium, AWS Lambda, AWS Fargate, and AWS Developer Tools ](https://aws.amazon.com/blogs/devops/using-aws-codepipeline-aws-codebuild-and-aws-lambda-for-serverless-automated-ui-testing/)

 **Related videos:** 
+ [ re:Invent 2020 \$1 Hands-off: Automating continuous delivery pipelines at Amazon ](https://www.youtube.com/watch?v=ngnMj1zbMPY)
+ [ re:Invent 2019 \$1 Amazon's Approach to high-availability deployment ](https://www.youtube.com/watch?v=bCgD2bX1LI4)

# OPS 7. How do you know that you are ready to support a workload?
<a name="ops-07"></a>

 Evaluate the operational readiness of your workload, processes and procedures, and personnel to understand the operational risks related to your workload. 

**Topics**
+ [

# OPS07-BP01 Ensure personnel capability
](ops_ready_to_support_personnel_capability.md)
+ [

# OPS07-BP02 Ensure a consistent review of operational readiness
](ops_ready_to_support_const_orr.md)
+ [

# OPS07-BP03 Use runbooks to perform procedures
](ops_ready_to_support_use_runbooks.md)
+ [

# OPS07-BP04 Use playbooks to investigate issues
](ops_ready_to_support_use_playbooks.md)
+ [

# OPS07-BP05 Make informed decisions to deploy systems and changes
](ops_ready_to_support_informed_deploy_decisions.md)
+ [

# OPS07-BP06 Create support plans for production workloads
](ops_ready_to_support_enable_support_plans.md)

# OPS07-BP01 Ensure personnel capability
<a name="ops_ready_to_support_personnel_capability"></a>

Have a mechanism to validate that you have the appropriate number of trained personnel to support the workload. They must be trained on the platform and services that make up your workload. Provide them with the knowledge necessary to operate the workload. You must have enough trained personnel to support the normal operation of the workload and troubleshoot any incidents that occur. Have enough personnel so that you can rotate during on-call and vacations to avoid burnout. 

 **Desired outcome:** 
+  There are enough trained personnel to support the workload at times when the workload is available. 
+  You provide training for your personnel on the software and services that make up your workload. 

 **Common anti-patterns:** 
+ Deploying a workload without team members trained to operate the platform and services in use. 
+  Not having enough personnel to support on-call rotations or personnel taking time off. 

 **Benefits of establishing this best practice:** 
+  Having skilled team members helps effective support of your workload. 
+  With enough team members, you can support the workload and on-call rotations while decreasing the risk of burnout. 

 **Level of risk exposed if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 Validate that there are sufficient trained personnel to support the workload. Verify that you have enough team members to cover normal operational activities, including on-call rotations. 

 **Customer example** 

 AnyCompany Retail makes sure that teams supporting the workload are properly staffed and trained. They have enough engineers to support an on-call rotation. Personnel get training on the software and platform that the workload is built on and are encouraged to earn certifications. There are enough personnel so that people can take time off while still supporting the workload and the on-call rotation. 

### Implementation steps
<a name="implementation-steps"></a>

1.  Assign an adequate number of personnel to operate and support your workload, including on-call duties, security issues, and lifecycle events, such as end of support and certificate rotation tasks. 

1.  Train your personnel on the software and platforms that compose your workload. 

   1.  [AWS Training and Certification](https://aws.amazon.com/training/) has a library of courses about AWS. They provide free and paid courses, online and in-person. 

   1.  [AWS hosts events and webinars](https://aws.amazon.com/events/) where you learn from AWS experts. 

1. Perform the following on a regular basis: 
   +  Evaluate team size and skills as operating conditions and the workload change. 
   +  Adjust team size and skills to match operational requirements. 
   +  Verify ability and capacity to [address planned lifecycle events](https://docs.aws.amazon.com/health/latest/ug/aws-health-planned-lifecycle-events.html), unplanned security, and operational notifications through AWS Health. 

 **Level of effort for the implementation plan:** High. Hiring and training a team to support a workload can take significant effort but has substantial long-term benefits. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS11-BP04 Perform knowledge management](ops_evolve_ops_knowledge_management.md) - Team members must have the information necessary to operate and support the workload. Knowledge management is the key to providing that. 

 **Related documents:** 
+  [AWS Events and Webinars](https://aws.amazon.com/events/) 
+  [AWS Training and Certification](https://aws.amazon.com/training/) 

# OPS07-BP02 Ensure a consistent review of operational readiness
<a name="ops_ready_to_support_const_orr"></a>

Use Operational Readiness Reviews (ORRs) to validate that you can operate your workload. ORR is a mechanism developed at Amazon to validate that teams can safely operate their workloads. An ORR is a review and inspection process using a checklist of requirements. An ORR is a self-service experience that teams use to certify their workloads. ORRs include best practices from lessons learned from our years of building software. 

 An ORR checklist is composed of architectural recommendations, operational process, event management, and release quality. Our Correction of Error (CoE) process is a major driver of these items. Your own post-incident analysis should drive the evolution of your own ORR. An ORR is not only about following best practices but preventing the recurrence of events that you’ve seen before. Lastly, security, governance, and compliance requirements can also be included in an ORR. 

 Run ORRs before a workload launches to general availability and then throughout the software development lifecycle. Running the ORR before launch increases your ability to operate the workload safely. Periodically re-run your ORR on the workload to catch any drift from best practices. You can have ORR checklists for new services launches and ORRs for periodic reviews. This helps keep you up to date on new best practices that arise and incorporate lessons learned from post-incident analysis. As your use of the cloud matures, you can build ORR requirements into your architecture as defaults. 

 **Desired outcome:**  You have an ORR checklist with best practices for your organization. ORRs are conducted before workloads launch. ORRs are run periodically over the course of the workload lifecycle. 

 **Common anti-patterns:** 
+ You launch a workload without knowing if you can operate it. 
+ Governance and security requirements are not included in certifying a workload for launch. 
+ Workloads are not re-evaluated periodically. 
+ Workloads launch without required procedures in place. 
+ You see repetition of the same root cause failures in multiple workloads. 

 **Benefits of establishing this best practice:** 
+  Your workloads include architecture, process, and management best practices. 
+  Lessons learned are incorporated into your ORR process. 
+  Required procedures are in place when workloads launch. 
+  ORRs are run throughout the software lifecycle of your workloads. 

 **Level of risk if this best practice is not established:** High 

## Implementation guidance
<a name="implementation-guidance"></a>

 An ORR is two things: a process and a checklist. Your ORR process should be adopted by your organization and supported by an executive sponsor. At a minimum, ORRs must be conducted before a workload launches to general availability. Run the ORR throughout the software development lifecycle to keep it up to date with best practices or new requirements. The ORR checklist should include configuration items, security and governance requirements, and best practices from your organization. Over time, you can use services, such as [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html), [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html), and [AWS Control Tower Guardrails](https://docs.aws.amazon.com/controltower/latest/userguide/guardrails.html), to build best practices from the ORR into guardrails for automatic detection of best practices. 

 **Customer example** 

 After several production incidents, AnyCompany Retail decided to implement an ORR process. They built a checklist composed of best practices, governance and compliance requirements, and lessons learned from outages. New workloads conduct ORRs before they launch. Every workload conducts a yearly ORR with a subset of best practices to incorporate new best practices and requirements that are added to the ORR checklist. Over time, AnyCompany Retail used [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) to detect some best practices, speeding up the ORR process. 

 **Implementation steps** 

 To learn more about ORRs, read the [Operational Readiness Reviews (ORR) whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html). It provides detailed information on the history of the ORR process, how to build your own ORR practice, and how to develop your ORR checklist. The following steps are an abbreviated version of that document. For an in-depth understanding of what ORRs are and how to build your own, we recommend reading that whitepaper. 

1. Gather the key stakeholders together, including representatives from security, operations, and development. 

1. Have each stakeholder provide at least one requirement. For the first iteration, try to limit the number of items to thirty or less. 
   +  [Appendix B: Example ORR questions](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/appendix-b-example-orr-questions.html) from the Operational Readiness Reviews (ORR) whitepaper contains sample questions that you can use to get started. 

1. Collect your requirements into a spreadsheet. 
   + You can use [custom lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html) in the [AWS Well-Architected Tool](https://console.aws.amazon.com/wellarchiected/) to develop your ORR and share them across your accounts and AWS Organization. 

1. Identify one workload to conduct the ORR on. A pre-launch workload or an internal workload is ideal. 

1. Run through the ORR checklist and take note of any discoveries made. Discoveries might be acceptable if a mitigation is in place. For any discovery that lacks a mitigation, add those to your backlog of items and implement them before launch. 

1. Continue to add best practices and requirements to your ORR checklist over time. 

 Support customers with Enterprise Support can request the [Operational Readiness Review Workshop](https://aws.amazon.com/premiumsupport/technology-and-programs/proactive-services/) from their Technical Account Manager. The workshop is an interactive *working backwards* session to develop your own ORR checklist. 

 **Level of effort for the implementation plan:** High. Adopting an ORR practice in your organization requires executive sponsorship and stakeholder buy-in. Build and update the checklist with inputs from across your organization. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+ [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) – Governance requirements are a natural fit for an ORR checklist. 
+ [OPS01-BP04 Evaluate compliance requirements](ops_priorities_compliance_reqs.md) – Compliance requirements are sometimes included in an ORR checklist. Other times they are a separate process. 
+ [OPS03-BP07 Resource teams appropriately](ops_org_culture_team_res_appro.md) – Team capability is a good candidate for an ORR requirement. 
+ [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) – A rollback or rollforward plan must be established before you launch your workload. 
+ [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md) – To support a workload you must have the required personnel. 
+ [SEC01-BP03 Identify and validate control objectives](https://docs.aws.amazon.com/wellarchitected/latest/framework/sec_securely_operate_control_objectives.html) – Security control objectives make excellent ORR requirements. 
+ [REL13-BP01 Define recovery objectives for downtime and data loss](https://docs.aws.amazon.com/wellarchitected/latest/framework/rel_planning_for_recovery_objective_defined_recovery.html) – Disaster recovery plans are a good ORR requirement. 
+ [COST02-BP01 Develop policies based on your organization requirements](https://docs.aws.amazon.com/wellarchitected/latest/framework/cost_govern_usage_policies.html) – Cost management policies are good to include in your ORR checklist. 

 **Related documents:** 
+  [AWS Control Tower - Guardrails in AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/guardrails.html) 
+  [AWS Well-Architected Tool - Custom Lenses](https://docs.aws.amazon.com/wellarchitected/latest/userguide/lenses-custom.html) 
+  [Operational Readiness Review Template by Adrian Hornsby](https://medium.com/the-cloud-architect/operational-readiness-review-template-e23a4bfd8d79) 
+  [Operational Readiness Reviews (ORR) Whitepaper](https://docs.aws.amazon.com/wellarchitected/latest/operational-readiness-reviews/wa-operational-readiness-reviews.html) 

 **Related videos:** 
+  [AWS Supports You \$1 Building an Effective Operational Readiness Review (ORR)](https://www.youtube.com/watch?v=Keo6zWMQqS8) 

 **Related examples:** 
+  [Sample Operational Readiness Review (ORR) Lens](https://github.com/aws-samples/custom-lens-wa-sample/tree/main/ORR-Lens) 

 **Related services:** 
+  [AWS Config](https://docs.aws.amazon.com/config/latest/developerguide/WhatIsConfig.html) 
+  [AWS Control Tower](https://docs.aws.amazon.com/controltower/latest/userguide/what-is-control-tower.html) 
+  [AWS Security Hub CSPM](https://docs.aws.amazon.com/securityhub/latest/userguide/what-is-securityhub.html) 
+  [AWS Well-Architected Tool](https://docs.aws.amazon.com/wellarchitected/latest/userguide/intro.html) 

# OPS07-BP03 Use runbooks to perform procedures
<a name="ops_ready_to_support_use_runbooks"></a>

 A *runbook* is a documented process to achieve a specific outcome. Runbooks consist of a series of steps that someone follows to get something done. Runbooks have been used in operations going back to the early days of aviation. In cloud operations, we use runbooks to reduce risk and achieve desired outcomes. At its simplest, a runbook is a checklist to complete a task. 

 Runbooks are an essential part of operating your workload. From onboarding a new team member to deploying a major release, runbooks are the codified processes that provide consistent outcomes no matter who uses them. Runbooks should be published in a central location and updated as the process evolves, as updating runbooks is a key component of a change management process. They should also include guidance on error handling, tools, permissions, exceptions, and escalations in case a problem occurs. 

 As your organization matures, begin automating runbooks. Start with runbooks that are short and frequently used. Use scripting languages to automate steps or make steps easier to perform. As you automate the first few runbooks, you'll dedicate time to automating more complex runbooks. Over time, most of your runbooks should be automated in some way. 

 **Desired outcome:** Your team has a collection of step-by-step guides for performing workload tasks. The runbooks contain the desired outcome, necessary tools and permissions, and instructions for error handling. They are stored in a central location (version control system) and updated frequently. For example, your runbooks provide capabilities for your teams to monitor, communicate, and respond to AWS Health events for critical accounts during application alarms, operational issues, and planned lifecycle events. 

 **Common anti-patterns:** 
+  Relying on memory to complete each step of a process. 
+  Manually deploying changes without a checklist. 
+  Different team members performing the same process but with different steps or outcomes. 
+  Letting runbooks drift out of sync with system changes and automation. 

 **Benefits of establishing this best practice:** 
+  Reducing error rates for manual tasks. 
+  Operations are performed in a consistent manner. 
+  New team members can start performing tasks sooner. 
+  Runbooks can be automated to reduce toil. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 Runbooks can take several forms depending on the maturity level of your organization. At a minimum, they should consist of a step-by-step text document. The desired outcome should be clearly indicated. Clearly document necessary special permissions or tools. Provide detailed guidance on error handling and escalations in case something goes wrong. List the runbook owner and publish it in a central location. Once your runbook is documented, validate it by having someone else on your team run it. As procedures evolve, update your runbooks in accordance with your change management process. 

 Your text runbooks should be automated as your organization matures. Using services like [AWS Systems Manager automations](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html), you can transform flat text into automations that can be run against your workload. These automations can be run in response to events, reducing the operational burden to maintain your workload. AWS Systems Manager Automation also provides a low-code [visual design experience](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-visual-designer.html) to create automation runbooks more easily. 

 **Customer example** 

 AnyCompany Retail must perform database schema updates during software deployments. The Cloud Operations Team worked with the Database Administration Team to build a runbook for manually deploying these changes. The runbook listed each step in the process in checklist form. It included a section on error handling in case something went wrong. They published the runbook on their internal wiki along with their other runbooks. The Cloud Operations Team plans to automate the runbook in a future sprint. 

### Implementation steps
<a name="implementation-steps"></a>

 If you don't have an existing document repository, a version control repository is a great place to start building your runbook library. You can build your runbooks using Markdown. We have provided an example runbook template that you can use to start building runbooks. 

```
# Runbook Title
## Runbook Info
| Runbook ID | Description | Tools Used | Special Permissions | Runbook Author | Last Updated | Escalation POC | 
|-------|-------|-------|-------|-------|-------|-------|
| RUN001 | What is this runbook for? What is the desired outcome? | Tools | Permissions | Your Name | 2022-09-21 | Escalation Name |
## Steps
1. Step one
2. Step two
```

1.  If you don't have an existing documentation repository or wiki, create a new version control repository in your version control system. 

1.  Identify a process that does not have a runbook. An ideal process is one that is conducted semiregularly, short in number of steps, and has low impact failures. 

1.  In your document repository, create a new draft Markdown document using the template. Fill in Runbook Title and the required fields under Runbook Info. 

1.  Starting with the first step, fill in the Steps portion of the runbook. 

1.  Give the runbook to a team member. Have them use the runbook to validate the steps. If something is missing or needs clarity, update the runbook. 

1.  Publish the runbook to your internal documentation store. Once published, tell your team and other stakeholders. 

1.  Over time, you'll build a library of runbooks. As that library grows, start working to automate runbooks. 

 **Level of effort for the implementation plan:** Low. The minimum standard for a runbook is a step-by-step text guide. Automating runbooks can increase the implementation effort. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_ops_model_def_proc_owners.html) 
+  [OPS07-BP04 Use playbooks to investigate issues](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_ready_to_support_use_playbooks.html) 
+  [OPS10-BP01 Use a process for event, incident, and problem management](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_event_response_event_incident_problem_process.html) 
+  [OPS10-BP02 Have a process per alert](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_event_response_process_per_alert.html) 
+  [OPS11-BP04 Perform knowledge management](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_evolve_ops_knowledge_management.html) 

 **Related documents:** 
+  [Achieving Operational Excellence using automated playbook and runbook](https://aws.amazon.com/blogs/mt/achieving-operational-excellence-using-automated-playbook-and-runbook/) 
+  [AWS Systems Manager: Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+  [Migration playbook for AWS large migrations - Task 4: Improving your migration runbooks](https://docs.aws.amazon.com/prescriptive-guidance/latest/large-migration-migration-playbook/task-four-migration-runbooks.html) 
+  [Use AWS Systems Manager Automation runbooks to resolve operational tasks](https://aws.amazon.com/blogs/mt/use-aws-systems-manager-automation-runbooks-to-resolve-operational-tasks/) 

 **Related videos:** 
+  [AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response](https://www.youtube.com/watch?v=E1NaYN_fJUo) 
+  [How to automate IT Operations on AWS \$1 Amazon Web Services](https://www.youtube.com/watch?v=GuWj_mlyTug) 
+  [Integrate Scripts into AWS Systems Manager](https://www.youtube.com/watch?v=Seh1RbnF-uE) 

 **Related examples:** 
+  [Well-Architected Labs: Automating operations with Playbooks and Runbooks](https://wellarchitectedlabs.com/operational-excellence/200_labs/200_automating_operations_with_playbooks_and_runbooks/) 
+  [AWS Blog Post: Build a Cloud Automation Practice for Operational Excellence: Best Practices from AWS Managed Services](https://aws.amazon.com/blogs/mt/build-a-cloud-automation-practice-for-operational-excellence-best-practices-from-aws-managed-services/) 
+  [AWS Systems Manager: Automation walkthroughs](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk.html) 
+  [AWS Systems Manager: Restore a root volume from the latest snapshot runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-document-sample-restore.html) 
+  [Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake](https://catalog.us-east-1.prod.workshops.aws/workshops/a5801f0c-7bd6-4282-91ae-4dfeb926a035/en-US) 
+  [Gitlab - Runbooks](https://gitlab.com/gitlab-com/runbooks) 
+  [Rubix - A Python library for building runbooks in Jupyter Notebooks](https://github.com/Nurtch/rubix) 
+  [Using Document Builder to create a custom runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-document-builder.html) 

 **Related services:** 
+  [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 

# OPS07-BP04 Use playbooks to investigate issues
<a name="ops_ready_to_support_use_playbooks"></a>

 *Playbooks* are step-by-step guides used to investigate an incident. When incidents happen, playbooks are used to investigate, scope impact, and identify a root cause. Playbooks are used for a variety of scenarios, from failed deployments to security incidents. In many cases, playbooks identify the root cause that a runbook is used to mitigate. Playbooks are an essential component of your organization's incident response plans. 

 A good playbook has several key features. It guides the user, step by step, through the process of discovery. Thinking outside-in, what steps should someone follow to diagnose an incident? Clearly define in the playbook if special tools or elevated permissions are needed in the playbook. Having a communication plan to update stakeholders on the status of the investigation is a key component. In situations where a root cause can't be identified, the playbook should have an escalation plan. If the root cause is identified, the playbook should point to a runbook that describes how to resolve it. Playbooks should be stored centrally and regularly maintained. If playbooks are used for specific alerts, provide your team with pointers to the playbook within the alert. 

 As your organization matures, automate your playbooks. Start with playbooks that cover low-risk incidents. Use scripting to automate the discovery steps. Make sure that you have companion runbooks to mitigate common root causes. 

 **Desired outcome:** Your organization has playbooks for common incidents. The playbooks are stored in a central location and available to your team members. Playbooks are updated frequently. For any known root causes, companion runbooks are built. 

 **Common anti-patterns:** 
+  There is no standard way to investigate an incident. 
+  Team members rely on muscle memory or institutional knowledge to troubleshoot a failed deployment. 
+  New team members learn how to investigate issues through trial and error. 
+  Best practices for investigating issues are not shared across teams. 

 **Benefits of establishing this best practice:** 
+  Playbooks boost your efforts to mitigate incidents. 
+  Different team members can use the same playbook to identify a root cause in a consistent manner. 
+  Known root causes can have runbooks developed for them, speeding up recovery time. 
+  Playbooks help team members to start contributing sooner. 
+  Teams can scale their processes with repeatable playbooks. 

 **Level of risk exposed if this best practice is not established:** Medium 

## Implementation guidance
<a name="implementation-guidance"></a>

 How you build and use playbooks depends on the maturity of your organization. If you are new to the cloud, build playbooks in text form in a central document repository. As your organization matures, playbooks can become semi-automated with scripting languages like Python. These scripts can be run inside a Jupyter notebook to speed up discovery. Advanced organizations have fully automated playbooks for common issues that are auto-remediated with runbooks. 

 Start building your playbooks by listing common incidents that happen to your workload. Choose playbooks for incidents that are low risk and where the root cause has been narrowed down to a few issues to start. After you have playbooks for simpler scenarios, move on to the higher risk scenarios or scenarios where the root cause is not well known. 

 Your text playbooks should be automated as your organization matures. Using services like [AWS Systems Manager Automations](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html), flat text can be transformed into automations. These automations can be run against your workload to speed up investigations. These automations can be activated in response to events, reducing the mean time to discover and resolve incidents. 

 Customers can use [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) to respond to incidents. This service provides a single interface to triage incidents, inform stakeholders during discovery and mitigation, and collaborate throughout the incident. It uses AWS Systems Manager Automations to speed up detection and recovery. 

 **Customer example** 

 A production incident impacted AnyCompany Retail. The on-call engineer used a playbook to investigate the issue. As they progressed through the steps, they kept the key stakeholders, identified in the playbook, up to date. The engineer identified the root cause as a race condition in a backend service. Using a runbook, the engineer relaunched the service, bringing AnyCompany Retail back online. 

### Implementation steps
<a name="implementation-steps"></a>

 If you don't have an existing document repository, we suggest creating a version control repository for your playbook library. You can build your playbooks using Markdown, which is compatible with most playbook automation systems. If you are starting from scratch, use the following example playbook template. 

```
# Playbook Title
## Playbook Info
| Playbook ID | Description | Tools Used | Special Permissions | Playbook Author | Last Updated | Escalation POC | Stakeholders | Communication Plan |
|-------|-------|-------|-------|-------|-------|-------|-------|-------|
| RUN001 | What is this playbook for? What incident is it used for? | Tools | Permissions | Your Name | 2022-09-21 | Escalation Name | Stakeholder Name | How will updates be communicated during the investigation? |
## Steps
1. Step one
2. Step two
```

1.  If you don't have an existing document repository or wiki, create a new version control repository for your playbooks in your version control system. 

1.  Identify a common issue that requires investigation. This should be a scenario where the root cause is limited to a few issues and resolution is low risk. 

1.  Using the Markdown template, fill in the Playbook Name section and the fields under Playbook Info. 

1.  Fill in the troubleshooting steps. Be as clear as possible on what actions to perform or what areas you should investigate. 

1.  Give a team member the playbook and have them go through it to validate it. If there's anything missing or something isn't clear, update the playbook. 

1.  Publish your playbook in your document repository and inform your team and any stakeholders. 

1.  This playbook library will grow as you add more playbooks. Once you have several playbooks, start automating them using tools like AWS Systems Manager Automations to keep automation and playbooks in sync. 

 **Level of effort for the implementation plan:** Low. Your playbooks should be text documents stored in a central location. More mature organizations will move towards automating playbooks. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_ops_model_def_proc_owners.html) 
+  [OPS07-BP03 Use runbooks to perform procedures](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_ready_to_support_use_runbooks.html) 
+  [OPS10-BP01 Use a process for event, incident, and problem management](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_event_response_event_incident_problem_process.html) 
+  [OPS10-BP02 Have a process per alert](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_event_response_process_per_alert.html) 
+  [OPS11-BP04 Perform knowledge management](https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/ops_evolve_ops_knowledge_management.html) 

 **Related documents:** 
+  [Achieving Operational Excellence using automated playbook and runbook](https://aws.amazon.com/blogs/mt/achieving-operational-excellence-using-automated-playbook-and-runbook/) 
+  [AWS Systems Manager: Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html) 
+  [Use AWS Systems Manager Automation runbooks to resolve operational tasks](https://aws.amazon.com/blogs/mt/use-aws-systems-manager-automation-runbooks-to-resolve-operational-tasks/) 

 **Related videos:** 
+  [AWS re:Invent 2019: DIY guide to runbooks, incident reports, and incident response (SEC318-R1)](https://www.youtube.com/watch?v=E1NaYN_fJUo) 
+  [AWS Systems Manager Incident Manager - AWS Virtual Workshops](https://www.youtube.com/watch?v=KNOc0DxuBSY) 
+  [Integrate Scripts into AWS Systems Manager](https://www.youtube.com/watch?v=Seh1RbnF-uE) 

 **Related examples:** 
+  [AWS Customer Playbook Framework](https://github.com/aws-samples/aws-customer-playbook-framework) 
+  [AWS Systems Manager: Automation walkthroughs](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk.html) 
+  [Building an AWS incident response runbook using Jupyter notebooks and CloudTrail Lake](https://catalog.workshops.aws/workshops/a5801f0c-7bd6-4282-91ae-4dfeb926a035/en-US) 
+  [Rubix – A Python library for building runbooks in Jupyter Notebooks](https://github.com/Nurtch/rubix) 
+  [Using Document Builder to create a custom runbook](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-walk-document-builder.html) 

 **Related services:** 
+  [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) 
+  [AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/what-is-incident-manager.html) 

# OPS07-BP05 Make informed decisions to deploy systems and changes
<a name="ops_ready_to_support_informed_deploy_decisions"></a>

Have processes in place for successful and unsuccessful changes to your workload. A pre-mortem is an exercise where a team simulates a failure to develop mitigation strategies. Use pre-mortems to anticipate failure and create procedures where appropriate. Evaluate the benefits and risks of deploying changes to your workload. Verify that all changes comply with governance. 

 **Desired outcome:** 
+  You make informed decisions when deploying changes to your workload. 
+  Changes comply with governance. 

 **Common anti-patterns:** 
+ Deploying a change to our workload without a process to handle a failed deployment.
+ Making changes to your production environment that are out of compliance with governance requirements.
+ Deploying a new version of your workload without establishing a baseline for resource utilization.

 **Benefits of establishing this best practice:** 
+  You are prepared for unsuccessful changes to your workload. 
+  Changes to your workload are compliant with governance policies. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Use pre-mortems to develop processes for unsuccessful changes. Document your processes for unsuccessful changes. Ensure that all changes comply with governance. Evaluate the benefits and risks to deploying changes to your workload. 

 **Customer example** 

 AnyCompany Retail regularly conducts pre-mortems to validate their processes for unsuccessful changes. They document their processes in a shared Wiki and update it frequently. All changes comply with governance requirements. 

 **Implementation steps** 

1.  Make informed decisions when deploying changes to your workload. Establish and review criteria for a successful deployment. Develop scenarios or criteria that would initiate a rollback of a change. Weigh the benefits of deploying changes against the risks of an unsuccessful change. 

1.  Verify that all changes comply with governance policies. 

1.  Use pre-mortems to plan for unsuccessful changes and document mitigation strategies. Run a table-top exercise to model an unsuccessful change and validate roll-back procedures. 

 **Level of effort for the implementation plan:** Moderate. Implementing a practice of pre-mortems requires coordination and effort from stakeholders across your organization 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS01-BP03 Evaluate governance requirements](ops_priorities_governance_reqs.md) - Governance requirements are a key factor in determining whether to deploy a change. 
+  [OPS06-BP01 Plan for unsuccessful changes](ops_mit_deploy_risks_plan_for_unsucessful_changes.md) - Establish plans to mitigate a failed deployment and use pre-mortems to validate them. 
+  [OPS06-BP02 Test deployments](ops_mit_deploy_risks_test_val_chg.md) - Every software change should be properly tested before deployment in order to reduce defects in production. 
+  [OPS07-BP01 Ensure personnel capability](ops_ready_to_support_personnel_capability.md) - Having enough trained personnel to support the workload is essential to making an informed decision to deploy a system change. 

 **Related documents:** 
+ [ Amazon Web Services: Risk and Compliance ](https://docs.aws.amazon.com/whitepapers/latest/aws-risk-and-compliance/welcome.html)
+ [AWS Shared Responsibility Model ](https://aws.amazon.com/compliance/shared-responsibility-model/)
+ [ Governance in the AWS Cloud: The Right Balance Between Agility and Safety ](https://aws.amazon.com/blogs/apn/governance-in-the-aws-cloud-the-right-balance-between-agility-and-safety/)

# OPS07-BP06 Create support plans for production workloads
<a name="ops_ready_to_support_enable_support_plans"></a>

 Enable support for any software and services that your production workload relies on. Select an appropriate support level to meet your production service-level needs. Support plans for these dependencies are necessary in case there is a service disruption or software issue. Document support plans and how to request support for all service and software vendors. Implement mechanisms that verify that support points of contacts are kept up to date. 

 **Desired outcome:** 
+  Implement support plans for software and services that production workloads rely on. 
+  Choose an appropriate support plan based on service-level needs. 
+  Document the support plans, support levels, and how to request support. 

 **Common anti-patterns:** 
+  You have no support plan for a critical software vendor. Your workload is impacted by them and you can do nothing to expedite a fix or get timely updates from the vendor. 
+  A developer that was the primary point of contact for a software vendor left the company. You are not able to reach the vendor support directly. You must spend time researching and navigating generic contact systems, increasing the time required to respond when needed. 
+  A production outage occurs with a software vendor. There is no documentation on how to file a support case. 

 **Benefits of establishing this best practice:** 
+  With the appropriate support level, you are able to get a response in the time frame necessary to meet service-level needs. 
+  As a supported customer you can escalate if there are production issues. 
+  Software and services vendors can assist in troubleshooting during an incident. 

 **Level of risk exposed if this best practice is not established:** Low 

## Implementation guidance
<a name="implementation-guidance"></a>

 Enable support plans for any software and services vendors that your production workload relies on. Set up appropriate support plans to meet service-level needs. For AWS customers, this means activating AWS Business Support or greater on any accounts where you have production workloads. Meet with support vendors on a regular cadence to get updates about support offerings, processes, and contacts. Document how to request support from software and services vendors, including how to escalate if there is an outage. Implement mechanisms to keep support contacts up to date. 

 **Customer example** 

 At AnyCompany Retail, all commercial software and services dependencies have support plans. For example, they have AWS Enterprise Support activated on all accounts with production workloads. Any developer can raise a support case when there is an issue. There is a wiki page with information on how to request support, whom to notify, and best practices for expediting a case. 

 **Implementation steps** 

1.  Work with stakeholders in your organization to identify software and services vendors that your workload relies on. Document these dependencies. 

1.  Determine service-level needs for your workload. Select a support plan that aligns with them. 

1.  For commercial software and services, establish a support plan with the vendors. 

   1.  Subscribing to AWS Business Support or greater for all production accounts provides faster response time from AWS Support and strongly recommended. If you don’t have premium support, you must have an action plan to handle issues, which require help from AWS Support. AWS Support provides a mix of tools and technology, people, and programs designed to proactively help you optimize performance, lower costs, and innovate faster. In addition, AWS Business Support provides additional benefits, including API access to AWS Trusted Advisor and AWS Health for programmatic integration with your systems, alongside other access methods like the AWS Management Console and Amazon EventBridge channels. 

1.  Document the support plan in your knowledge management tool. Include how to request support, who to notify if a support case is filed, and how to escalate during an incident. A wiki is a good mechanism to allow anyone to make necessary updates to documentation when they become aware of changes to support processes or contacts. 

 **Level of effort for the implementation plan:** Low. Most software and services vendors offer opt-in support plans. Documenting and sharing support best practices on your knowledge management system verifies that your team knows what to do when there is a production issue. 

## Resources
<a name="resources"></a>

 **Related best practices:** 
+  [OPS02-BP02 Processes and procedures have identified owners](ops_ops_model_def_proc_owners.md) 

 **Related documents:** 
+ [AWS Support Plans ](https://docs.aws.amazon.com/awssupport/latest/user/aws-support-plans.html)

 **Related services:** 
+ [AWS Business Support ](https://aws.amazon.com/premiumsupport/plans/business/)
+ [AWS Enterprise Support ](https://aws.amazon.com/premiumsupport/plans/enterprise/)