

# 11 – Detect and react to failures
<a name="design-principle-11"></a>

 **How do you detect and react to failures impacting your SAP workload?** Design how software or operating procedures can help ensure the health and resilience of your SAP workload. Monitor for potential and actual failures, focusing on prevention where possible. Consider whether a component is distributed or is a single point of failure and design a resiliency solution that minimizes the impact to your workload. In addition to testing periodically to understand your risk profile, examine how automation could improve your resilience. 

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/latest/sap-lens/design-principle-11.html)

 For more details, refer to the following: 
+  AWS Documentation: [Architecture Guidance for Availability and Reliability of SAP on AWS including Failure Scenarios and Architecture Patterns ](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html)

# Best Practice 11.1 – Monitor failures of the SAP application, AWS resources, and connectivity
<a name="best-practice-11-1"></a>

Monitoring for failures of the SAP application, AWS resources, and connectivity helps you to react to failures or potential failures in a timely manner.

 **Suggestion 11.1.1 – Use AWS Personal Health Dashboard and notifications** 

 The [Health Dashboard](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/) gives you a personalized view of the status of the AWS services that power your applications, enabling you to quickly see when there are issues impacting your SAP workload. For example, in the event of a lost [Amazon Elastic Block Store (Amazon EBS)](https://aws.amazon.com/ebs/) volume associated with one of your [Amazon EC2](https://aws.amazon.com/ec2/) instances. 

 The dashboard also provides forward looking notifications, and you can set up alerts across multiple channels, including email, so that you receive timely and relevant information to help plan for scheduled changes. For example, in the event of AWS hardware maintenance activities that impact one of your [Amazon EC2](https://aws.amazon.com/ec2/) instances, you would receive a notification with information to help you plan for and proactively address any issues associated with the upcoming change. 

 **Suggestion 11.1.2 – Evaluate AWS services to understand the health of your SAP system** 

 AWS provides a number of [management and governance](https://aws.amazon.com/products/management-and-governance/) services that you should evaluate, including Amazon CloudWatch and Amazon CloudWatch Application Insights for SAP. Focus on the metrics that indicate a failure or potential failure, such as EC2 instance failure, high CPU utilization, and file system utilization. 

 Refer to the Operational Excellence pillar for more details: 
+  SAP Lens [Operational Excellence]: [Best Practice 1.1 - Implement prerequisites for monitoring SAP on AWS](best-practice-1-1.md) 
+  SAP Lens [Operational Excellence]: [Best Practice 1.4 - Implement workload configuration monitoring](best-practice-1-4.md) 

 **Suggestion 11.1.3 – Evaluate the capability of SAP tools to monitor failures** 

 Tools from SAP, such as Solution Manager and Landscape Manager, allow you to view any monitoring data in the context of the application. The following monitoring solutions are available from SAP. Review any additional licensing costs as part of the evaluation of these tools. 
+  SAP Documentation: [SAP Focused run](https://support.sap.com/en/alm/sap-focused-run.html) 
+  SAP Documentation: [SAP Solution Manager](https://support.sap.com/en/alm/solution-manager.html) 
+  SAP Documentation: [SAP Landscape Manager (LaMa)](https://help.sap.com/viewer/lama_help) 
+  SAP Note: [2574820 - SAP Landscape Management Cloud Manager for Amazon Web Services (AWS)](https://launchpad.support.sap.com/#/notes/2574820) [Requires SAP Portal Access] 

 **Suggestion 11.1.4 – Evaluate third-party tools for AWS and SAP monitoring** 

 The following monitoring solutions are available from the AWS Marketplace. You should evaluate these and other third-party tools. 
+  AWS Documentation: [Monitoring Solutions in AWS Marketplace](https://aws.amazon.com/marketplace/b/2649280011?ref_=mp_nav_category_2649280011) 

# Best Practice 11.2 – Define an approach to maintain availability
<a name="best-practice-11-2"></a>

Maintain availability by having a resilient architecture that can sustain the failure of a single technical component or AWS service. Implement mechanisms, which could include redundant capacity, load balancing, and software clusters.

 **Suggestion 11.2.1 – Avoid failures due to exhausted resources or service deterioration** 

Investigate over-provisioning of resources, proactive monitoring of growth, and throttling usage by setting limits.

 The operational excellence pillar covers the different ways in which you can understand the state of your SAP application and ensure that the appropriate actions are taken, see [Operational Excellence]: [1 - Design SAP workload to allow understanding and reaction to its state](design-principle-1.md). 

 The performance pillar can assist with guidance on right-sizing and scaling capacity [Performance]: [16 - Understand ongoing performance and optimization options](design-principle-16.md). 

 **Suggestion 11.2.2 – Have a strategy for scheduled maintenance** 

 If your business has a requirement to minimize scheduled outages, you should develop a strategy for maintenance at all levels – SAP application, database, operating system, and AWS. Consider the following: 
+ Use of replication and cluster solutions to alternate the primary and secondary node.
+ Excess capacity and mechanisms to scale up and down to facilitate rolling outages.
+  Use of a live patching approach for the operating system, if possible. 
  +  [SUSE Linux Enterprise Live Patching](https://www.suse.com/products/live-patching/) 
  +  [Red Hat Reducing downtime for SAP HANA Whitepaper](https://www.redhat.com/cms/managed-files/pa-sap-hana-reducing-downtime-overview-f22788pr-202004-en.pdf) 
+  AWS Documentation: [AWS Systems Manager Patch Manager Patch Groups](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-patch.html) 
+  SAP Note: [1913302 - HANA: Suspend DB connections for short maintenance tasks](https://launchpad.support.sap.com/#/notes/1913302) [Requires SAP Portal Access] 
+  SAP Note: [2077934 - Rolling kernel switch in HA environments](https://launchpad.support.sap.com/#/notes/2077934) [Requires SAP Portal Access] 
+  SAP Note: [953653 - Rolling Kernel Switch](https://launchpad.support.sap.com/#/notes/953653) [Requires SAP Portal Access] 
+  SAP Note: [2254173 - Linux: Rolling Kernel Switch in Pacemaker-based NetWeaver HA environments](https://launchpad.support.sap.com/#/notes/2254173) [Requires SAP Portal Access] 

You should also evaluate the elastic capabilities of AWS services to reduce the overall downtime of scheduled maintenance by temporarily increasing performance. For example, scaling up the size of the Amazon EC2 instance running your database to provide more CPU and storage throughput for upgrade activities, or switching your EBS volumes type from `gp2` to `io2` to improve storage throughput during a database reorganization.

 **Suggestion 11.2.3 – Protect SAP single points of failure with software clusters or other mechanisms** 

You can use a high availability (HA) clustering solution for autonomous failover of SAP single points of failure (SAP Central Services and database) across Availability Zones.

 There are multiple SAP-certified clustering solutions [listed on the SAP website](https://wiki.scn.sap.com/wiki/display/SI/Certified+HA-Interface+Partners). SAP clustering solutions are supported by the cluster software vendors themselves, not by SAP. SAP only certifies the solution. Any custom-built solution is not certified and will need to be supported by the solution builder. 

If you choose not to use a clustering solution for your single points of failure, consider scripting or runbooks to minimize the errors associated with restoring services.

 **Suggestion 11.2.4 – Consider redundant capacity or automatic scaling for components that support it** 

Evaluate static, dynamic, or scheduled capacity changes to match your usage. Examine the minimum capacity requirements and how they would be impacted by failures and maintenance. Overprovision where appropriate to allow time to recover from failure.

If you need to maintain 100% capacity in the event of an AZ failure, then you should consider deploying the application tier across three AZs, each with 50% of the total required capacity.

 In addition to deploying the SAP Application Server Layer across multiple AZs, you could consider scaling solutions such as the one described in the following SAP on AWS Blog post that leverages the capabilities of [Amazon EC2 Auto Scaling](https://aws.amazon.com/ec2/autoscaling). 
+  SAP on AWS Blog: [Using AWS to enable SAP Application Auto Scaling](https://aws.amazon.com/blogs/awsforsap/using-aws-to-enable-sap-application-auto-scaling/) 
+  AWS Documentation: [Amazon EC2 Instance Types for SAP](https://aws.amazon.com/sap/instance-types/) 
+  SAP Note: [1656099 - SAP Applications on AWS: Supported DB/OS and Amazon EC2 products](https://launchpad.support.sap.com/#/notes/1656099) [Requires SAP Portal Access] 

 **Suggestion 11.2.5 – Ensure the availability of capacity for all identified failure scenarios** 

 The following are examples of failure scenarios that could be used to guide your analysis. Granularity and coverage of the scenarios, classification, and impact will vary depending on your requirements and architecture. 

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/latest/sap-lens/best-practice-11-2.html)

Further guidance on capacity reservations is available in [Reliability]: [Suggestion 10.2.5 - Investigate strategies for ensuring capacity](best-practice-10-2.md) and in the AWS whitepaper: [Architecture Guidance for Availability and Reliability of SAP on AWS](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html). 

You can review what Reserved Instances you have available within your AWS account using the [Reserved Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ri-market-concepts-buying.html#view-reserved-instances) section of the Amazon EC2 console. You can review what On-Demand Capacity Reservations you have available using the [Capacity Reservations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-reservations-using.html#capacity-reservations-view) section of the Amazon EC2 console.

 **Suggestion 11.2.6 – Use AWS services that have inherent availability where applicable** 

 Several AWS services have inherent availability as part of their design and run across multiple Availability Zones for high availability. Some of the relevant services used in an SAP context include: 
+  AWS Service: [Amazon EFS](https://docs.aws.amazon.com/efs/latest/ug/how-it-works.html) 
+  AWS Service: [Elastic Load Balancing](https://docs.aws.amazon.com/elasticloadbalancing/latest/userguide/how-elastic-load-balancing-works.html) 
+  AWS Service: [Route 53](https://aws.amazon.com/route53/faqs/) 
+  AWS Service: [AWS Transit Gateway](https://docs.aws.amazon.com/vpc/latest/tgw/how-transit-gateways-work.html) 
+  AWS Service: [Amazon S3](https://aws.amazon.com/s3/) 
+  AWS Service: [Amazon FSx](https://docs.aws.amazon.com/fsx/index.html) 

In addition, components that use stateless services, such as bastian hosts or SAProuter, can use Auto Scaling Groups to achieve high availability.

 **Suggestion 11.2.7 -– Follow AWS best practices to ensure network connectivity** 

 Evaluate one or more of the following AWS best practices to ensure the resilience of network connectivity to the AWS Region in use: 
+  AWS Documentation: [AWS Direct Connect Resiliency Toolkit](https://docs.aws.amazon.com/directconnect/latest/UserGuide/resilency_toolkit.html) 
+  AWS Documentation: [AWS VPN CloudHub](https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/aws-vpn-cloudhub.html) 
+ AWS Documentation: [AWS Cloud WAN](https://aws.amazon.com/cloud-wan/)

 If your cluster solution relies on an overlay IP consider the following to enable access from outside of the VPC: 
+  AWS Documentation: [SAP on AWS High Availability with Overlay IP Address Routing](https://docs.aws.amazon.com/sap/latest/sap-hana/sap-ha-overlay-ip.html) 

# Best Practice 11.3 – Define an approach to restore service availability
<a name="best-practice-11-3"></a>

Restoring availability assumes that for a particular failure scenario, some loss of service will occur. The restore approach should examine the amount of time needed to restore service, and the actions required to meet the availability goal.

 **Suggestion 11.3.1 – Enable instance recovery for EC2 instances** 

 AWS provides two modes of instance recovery: simplified (on by default) and Amazon CloudWatch action-based (configurable). Both modes monitor an Amazon EC2 instance and automatically recover the instance if it becomes impaired due to an underlying hardware failure. This feature can remove the need for manual intervention, but startup, application restart, and load times should be factored into the recovery time objective (RTO).

CloudWatch action-based alarms are customizable, which can help you to control the recovery time of an instance for standalone instances.

If you intend to use a clustering solution to protect against hardware failure, you should evaluate if instance recovery is compatible with the cluster solution. 
+  AWS Documentation: [Amazon EC2 Instance Recovery](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html) 
+  SAP on AWS Documentation: [Technical requirements for high availability clusters](https://docs.aws.amazon.com/sap/latest/sap-netweaver/technical-requirements.html) 

 **Suggestion 11.3.2 – Have a strategy to rebuild EC2 instances using AMIs and infrastructure as code** 

 The benefit of infrastructure as code (IaC) is the ability to build and tear down entire environments programmatically. If architected for resiliency, an environment can be implemented in minutes using [CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) templates or [AWS Systems Manager automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html). Automation is critical for maintaining high availability and fast recovery. 

 You should evaluate the following AWS services as part of your strategy: 
+  AWS Service: [EC2 Image Builder](https://aws.amazon.com/image-builder/) 
+  AWS Service: [AWS Launch Wizard for SAP](https://docs.aws.amazon.com/launchwizard/latest/userguide/launch-wizard-sap.html) 
+  AWS Service: [AWS Cloud Development Kit (AWS CDK)](https://aws.amazon.com/cdk/) 
+  SAP on AWS Blog: [DevOps for SAP](https://aws.amazon.com/blogs/awsforsap/category/devops/) 

 **Suggestion 11.3.3 – Understand Amazon EBS failures** 

 Failure of one or more EBS volumes could impact the availability and durability of your SAP workload. Therefore, you should understand the Amazon EBS failure rates, notification mechanisms, and recovery options. 
+  AWS Documentation: [Amazon EBS Durability](https://aws.amazon.com/ebs/features/#Amazon_EBS_availability_and_durability) 
+  AWS Documentation: [Monitor the status of your volumes](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html) 
+  AWS Service: [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/) 
+  AWS Documentation: [Volume recovery using Amazon EBS Snapshots](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html) 

 **Suggestion 11.3.4 – Have a strategy for reacting to AWS Personal Health Dashboard notifications** 

 You should have a strategy for receiving and actioning notifications from your AWS Personal Health Dashboard. This could include using CloudWatch to start Amazon SNS or integration with your ITSM tools via the [AWS Health API](https://docs.aws.amazon.com/health/latest/ug/health-api.html). 

 **Suggestion 11.3.5 – Ensure that you are protected against accidental or malicious events impacting availability** 

You should consider the following approaches for ensuring that you are protected against accidental or malicious events that could impact the availability of your SAP workload.
+  Implement a [principle of least privilege](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege) and enforce separation of duties within AWS Identity and Access Management. 
+  Follow the guidance in AWS Knowledge Center article: [How do I protect my data against accidental EC2 instance termination?](https://aws.amazon.com/premiumsupport/knowledge-center/accidental-termination/) 
+  Follow the [Best practices for Amazon EC2.](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-best-practices.html) 
+  You should also follow the security guidance in [Security]: [Best Practice 8.3 - Secure your data recovery mechanisms to protect against threats.](best-practice-8-3.md) 

 **Suggestion 11.3.6 – Identify dependencies beyond the SAP workload in AWS** 

Understand the underlying dependencies for your SAP business processes, including shared services and supporting components or systems. Some examples include Active Directory, DNS, identity providers, SaaS services, and on-premises systems. Assess the impact of failure and the required mitigations.

# Best Practice 11.4 – Conduct periodic tests of resilience
<a name="best-practice-11-4"></a>

Periodically test resilience against critical failure scenarios to prove that software and procedures result in a predictable outcome. Evaluate any changes to architecture, software, or support personnel to determine if additional testing is necessary.

 **Suggestion 11.4.1 – Define the in-scope critical failure scenarios based on your business requirements** 

 You should define which critical failure scenarios you are able to test, aligned with your business requirements. The following are examples of failure scenarios which could be used to guide your analysis. Granularity and coverage of the scenarios, classification and impact will vary depending on your requirements and architecture. 

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/latest/sap-lens/best-practice-11-4.html)

 **Suggestion 11.4.2 – Define a set of test cases to simulate critical failures** 

You should have a complete set of tests defined to simulate the critical failure scenarios that would impact your SAP workload.

You should be aware that for some failure scenarios a simulation might not fully represent the actual failure that would occur. For example, to simulate a hardware issue, you cannot cause a failure of an EC2 instance, but for Nitro-based instances you can generate a kernel panic to cause the instance to reboot.

 In addition, [AWS Fault Injection Simulation](https://aws.amazon.com/fis/) is designed to help simulate failures within your AWS resources. 
+  AWS Documentation: [High Availability Configuration Guide for SAP on HANA](https://docs.aws.amazon.com/sap/latest/sap-hana/sap-hana-on-aws-ha-configuration.html) 
+  AWS Documentation: [Send a diagnostic interrupt](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/diagnostic-interrupt.html#diagnostic-interrupt-prereqs) 

 **Suggestion 11.4.3 – Define the expected behavior for each test case** 

You should have a documented set of expected outcomes to baseline your testing.

 **Suggestion 11.4.4 – Define an approach for evaluating the impact of a change and the subsequent testing required** 

You should have an approach defined to evaluate the impact of a change on your environment and the testing required as part of that change to help ensure that it does not invalidate your approach to availability and reliability. Examples of these types of changes include software upgrades, patches, and parameter changes.

 **Suggestion 11.4.5 – Define a test schedule** 

Ensure that you have a test schedule that covers the initial implementation, testing of changes, and periodic validation of your environment.

 **Suggestion 11.4.6 – Review the testing outcomes** 

Based on the test outcomes, identify any improvements to the test cases, configuration or architecture.

 **Suggestion 11.4.7 – Define the required activities to return to a pre-test state** 

As part of each test, you should define the required activities to return to the pre-test state. This is to ensure that each test case is isolated from other tests and that the testing does not impact the availability and reliability of a production system.

# Best Practice 11.5 – Automate reaction to failure
<a name="best-practice-11-5"></a>

You can minimize the impact to service by automating the response to failure. Design automation to respond to failure, impaired capacity, or loss of connectivity. Ensure clear arbitration criteria are defined to avoid false positives.

 **Suggestion 11.5.1 – Evaluate your automation for application awareness** 

For automation solutions that protect an application, evaluate the impact on state – for example, connected user sessions, logon targets, data replication consistency, and data corruption risk. 

 **Suggestion 11.5.2 – Evaluate the health check mechanisms that initiate automation** 

Health checks should be designed with controls to help ensure that automations are not started because of false positives.

Where possible, rely on the data plane over the control plane for resilience. The control plane is used to configure resources, and the data plane delivers services. Data planes typically have higher availability design goals than control planes and are usually less complex.
+  AWS Documentation: [Static stability using Availability Zones](https://aws.amazon.com/builders-library/static-stability-using-availability-zones/) 