# 10 – Design to withstand failure
<a name="design-principle-10"></a>

 **How do you design your SAP workload to withstand failure?** Work backwards from your business requirements to define an approach for meeting the availability goals of your SAP infrastructure and data. For each failure scenario, the resiliency requirements, acceptable data loss, and mean time to recover (MTTR) need to be proportionate to the criticality of the component and the business applications it supports. Select from one of the AWS architecture patterns provided for SAP availability, and evaluate it based on the criteria you define. These criteria should include the risk and impact for each failure as well as taking into consideration cost and performance. In all cases, use initial and periodic testing to validate your decisions. 

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/latest/sap-lens/design-principle-10.html)

 For more details, see the following information: 
+  AWS Documentation: [Architecture Guidance for Availability and Reliability of SAP on AWS including Failure Scenarios and Architecture Patterns ](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html)
+  AWS Documentation: [The Amazon Builders' Library: Static stability using Availability Zones](https://aws.amazon.com/builders-library/static-stability-using-availability-zones/) 
+  AWS Documentation: [Direct Connect Resiliency Recommendations](https://aws.amazon.com/directconnect/resiliency-recommendation/)
+ AWS Documentation: [Disaster Recovery of Workloads on AWS: Recovery in the Cloud](https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/introduction.html)
+  SAP Documentation: [SAP HANA System Architecture Overview](https://help.sap.com/viewer/6b94445c94ae495c83a19646e7c3fd56/LATEST/en-US/1b4477a539ab4b77a3bfe2a6835b5e0e.html) 

# Best Practice 10.1 – Agree on SAP workload availability goals that align with your business requirements
<a name="best-practice-10-1"></a>

Understanding your availability goals is the first step to help ensure that you focus on the factors important to your organization. This helps you to define criteria that can be used to evaluate your architectural patterns.

 **Suggestion 10.1.1 – Identify SAP applications in scope and their interdependencies** 

Identify the SAP applications that you have deployed or will deploy in AWS. Understand any dependencies these applications have regardless of their location.

 **Suggestion 10.1.2 – Classify systems based on the impact of failure** 

 There is no open standard for system classification aligned with planned availability and risk of failure. Defining systems using terms such as Mission Critical or Highly Important can help with defining patterns, identifying application grouping, and justifying costs. Production applications might be impacted differently by an outage. Factors to consider might include: 
+ Revenue generating or revenue reporting
+ External or internal facing
+ Core business vs. technical support
+ Closely coupled vs. loosely coupled with other systems

Non-production environments can also play an important role in indirectly supporting the business. They should also be classified according to project phase and scale, taking into account transport paths (such as business as usual and projects) and supporting role (such as development, unit test, production copy, and training).

 **Suggestion 10.1.3 – Assess the business impact of an outage** 

The impact should be measurable and take into consideration the duration of the outage. Examples of areas of impact include health and safety, financial, legal, regulatory, or brand.

 **Suggestion 10.1.4 – Understand compliance and regulatory requirements** 

Understand compliance or regulatory requirements for data residency and distance between locations to help ensure business continuity.

 **Suggestion 10.1.5 – Define minimum acceptable percentage uptime** 

 For each system, or group of systems, agree and document an acceptable availability percentage which matches with business requirements. The following terms are used in this context 
+ MTTR – Mean Time to Recovery
+ RTO – Recovery Time Objective
+ RPO – Recovery Point Objective

 A full explanation of the terms can be found in the Well-Architected Framework [Reliability]: [Availability](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/availability.html). Additional information on reliability in SAP can be found in the whitepaper: 
+  AWS Documentation: [Architecture Guidance for Availability and Reliability of SAP on AWS](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html) 

# Best Practice 10.2 – Select an architecture suitable for your availability and capacity requirements
<a name="best-practice-10-2"></a>

There are standard architectural patterns for SAP availability to suit the requirements of most customers deploying SAP on AWS. Use the following suggestions to determine what patterns best meet your availability and capacity requirements. Evaluate the risk and impact of each failure scenario against your business requirements.

 Additional information on availability in SAP can be found in the whitepaper [Architecture Guidance for Availability and Reliability of SAP on AWS](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html). 

 **Suggestion 10.2.1 – Identify all components and AWS services that are required for your SAP system** 

Identify all the required technical components of your SAP system, starting with the core (database, application servers, central services, global file systems) and extending to optional components (for example, Web Dispatchers, SAProuter, Cloud Connector). Determine the required AWS services to support these components and review the shared responsibility model for resiliency.
+ AWS Documentation: [Shared Responsibility Model for Resiliency](https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/shared-responsibility-model-for-resiliency.html)

 **Suggestion 10.2.2 – Use SLAs, durability, availability, and historical data as a guide to the likelihood of failure** 

Likelihood of failure is subjective. Published service level agreements (SLAs) and past performance can only be used to guide the risk of potential future failure. However, the assumed frequency of various scenarios remains a useful data point. Something which is statistically likely to happen once a year might have a greater impact on design decisions than a failure that is yet to occur.

 The following information can be used to better understand the services: 
+  [AWS Health Dashboard](https://aws.amazon.com/premiumsupport/technology/aws-health-dashboard/) provides alerts and remediation guidance when AWS is experiencing events that might impact you 
+  [AWS Post-Event Summaries](https://aws.amazon.com/premiumsupport/technology/pes/) are provided for all major service events which impact AWS service availability 
+  [Amazon Compute Service Level Agreement](https://aws.amazon.com/compute/sla/) lists service level agreements (SLAs) for compute services 
+  AWS Documentation: [Amazon EBS Durability and Availability](https://docs.aws.amazon.com/whitepapers/latest/aws-storage-services-overview/durability-and-availability-3.html) 
+  AWS Documentation: [Amazon EFS Data Protection and availability](https://aws.amazon.com/efs/faq/#Data_protection_and_availability) 
+  AWS Documentation: [Direct Connect Resiliency Recommendations](https://aws.amazon.com/directconnect/resiliency-recommendation/?nc=sn&loc=4&dn=2) 

The likelihood of failure of other supporting services should also be evaluated including, but not limited to, domain name services, load balancers, and serverless functions.

 More information can be found in the [Architecture Guidance for Availability and Reliability of SAP on AWS whitepaper](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html). 

 **Suggestion 10.2.3 – Assess options for clustering, resilience, and load balancing** 

 An SAP system can be distributed across multiple hosts, with differing mechanisms for ensuring availability. For example, a clustering solution can be used to protect single points of failure (for example, the SAP database and SAP Central Services). The SAP application tier can be scaled horizontally and load balancing can be used to make the web dispatcher highly available. 
+  AWS Documentation: [SAP NetWeaver Deployment and Operations Guide for Windows - High Availability System Deployment](https://docs.aws.amazon.com/sap/latest/sap-netweaver/net-win-high-availability-system-deployment.html) 
+  AWS Documentation: [SAP on AWS – IBM Db2 HADR with Pacemaker](https://docs.aws.amazon.com/sap/latest/sap-AnyDB/sap-ibm-pacemaker.html) 
+  AWS Documentation: [SQL Server Deployment for High Availability](https://docs.aws.amazon.com/sap/latest/sap-netweaver/sql-server-deployment-for-high-availability.html) 
+  SAP Documentation: [High Availability Partners](https://wiki.scn.sap.com/wiki/display/SI/High+Availability+Partner+Information) 

 **Suggestion 10.2.4 - Determine the availability of EC2 instance families within Availability Zones** 

 Some Amazon EC2 instance families (for example, `X` and `U`) are not available across all AZs. Check with your AWS account team or Support to confirm that the EC2 instance families you want to use are available in the intended Availability Zones. Note that the logical AZ identifiers might be different across different accounts. See the AWS documentation for more information. 
+  AWS Documentation: [AZ IDs for your AWS resources](https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html) 

 **Suggestion 10.2.5 – Investigate strategies for ensuring capacity** 

The best way to help ensure capacity is to have a similarly sized instance available in case of failure. Other strategies include cloud native options (for example, On-Demand Instances, EC2 instance recovery) or re-allocating shared capacity.

We recommend that you make a capacity commitment in at least two AZs for Amazon EC2 instances that support SAP single points of failure so that the capacity is available at the time you need it.

Amazon EC2 capacity can be reserved using [Zonal Reserved Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/reserved-instances-scope.html) or [On-Demand Capacity Reservations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/capacity-reservations-using.html). Both Zonal Reserved Instances and On-Demand Capacity Reservations can be shared between AWS accounts within the same AWS organization, which allows for the approach of using sacrificial capacity from another environment in the event of significant failure (for example, a complete AZ failure). 

 Further guidance on capacity reservations is available in: 
+  AWS Documentation: [Architecture Guidance for Availability and Reliability of SAP on AWS](https://docs.aws.amazon.com/sap/latest/general/architecture-guidance-of-sap-on-aws.html) 

 **Suggestion 10.2.6 – Design your VPC across multiple Availability Zones** 

Design your VPC and subnets to ensure that instances can be provisioned in multiple Availability Zones, even if your initial design only relies on one or two AZs. This builds resilience into your design and helps ensure that connectivity and access to services can be confirmed in advance.

# Best Practice 10.3 – Define an approach to help ensure the availability of critical SAP data
<a name="best-practice-10-3"></a>

The business data for an SAP application is primarily stored within the database, but may also include file-based data or binaries (for example, executables, libraries, scripts, configuration, and interface files).

 **Suggestion 10.3.1 – Evaluate MTTR requirements and identify how they can be met** 

 In [Reliability] [Suggestion 10.1.5 – Define minimum acceptable percentage uptime](best-practice-10-1.md), you will have defined the MTTR requirements for each of your applications. Having assessed the risk of failures and the mechanisms for protecting system availability, confirm your requirements can be met, and document the expectations for MTTR against each failure scenario. If compromises need to be made for cost, complexity, or consistency, consult with the business owners to reach an agreement. 

 **Suggestion 10.3.2 – Determine in which failure scenarios a recovery from backup would be necessary** 

 Backup is often a secondary mechanism for ensuring or recovering availability, but most architectures will have some reliance on backups. The following are examples of failure scenarios that could be used to guide your analysis. The granularity of the scenarios, classification, and impact will vary depending on your requirements and architecture. 

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/latest/sap-lens/best-practice-10-3.html)

 **Suggestion 10.3.3 – Determine where data replication is required** 

Data replication is used to improve reliability by having copies of the same data in multiple locations and is often a requirement for systems with a low RPO. When determining whether replication is required for availability or recovery, consider whether the service is Zonal (for example, Amazon EC2 and Amazon EBS and the databases they support) or Regional (for example, shared storage and Amazon S3).


**Database replication**  
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/wellarchitected/latest/sap-lens/best-practice-10-3.html)


**AWS service replication options**  

| Service | Operating level | Replication options available | Guidance | 
| --- | --- | --- | --- | 
| Amazon EFS | File system |  Continuous asynchronous replication within a Region and cross Region  |  [Amazon EFS Replication](https://docs.aws.amazon.com/efs/latest/ug/efs-replication.html)  | 
| Amazon FSx for Windows File Server | File system |  Scheduled asynchronous replication within a Region and cross Region using AWS DataSync  |  [Scheduled replication using AWS DataSync](https://docs.aws.amazon.com/fsx/latest/WindowsGuide/scheduled-replication-datasync.html)  | 
| Amazon FSx for NetApp ONTAP | File system |  Scheduled asynchronous replication within a region and cross region via NetApp SnapMirror  |  [Scheduled replication using NetApp SnapMirror](https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/scheduled-replication.html)  | 
| Amazon S3 | S3 bucket | Continuous asynchronous replication within a Region and cross Region |  [Amazon S3 Replicating objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html)  | 
| AWS Elastic Disaster Recovery | EC2 instance | Continuous asynchronous replication within a Region and cross Region |  [AWS Elastic Disaster Recovery](https://docs.aws.amazon.com/drs/latest/userguide/what-is-drs.html)  | 

 **Suggestion 10.3.4 – Build a strategy to ensure consistent configuration data and binaries** 

It is important to have consistent configuration data and binaries to help ensure predictable behavior and a tested setup following a failure. This can include operating system packages, application parameters, and cluster configuration. Determine how you could ensure alignment across all instances for an application, including those which are there for resilience (for example, additional application servers, secondary database nodes).

Amazon EFS, Amazon FSx, and Amazon S3 provide a durable location for shared binaries or configuration that can be managed centrally.

 Refer to [Operational Excellence] [Best Practice 2.1 - Use version control and configuration management](best-practice-2-1.md) pillar for mechanisms to control versions and manage configuration. 

 **Suggestion 10.3.5 – Have a holistic approach to data consistency** 

The approach to ensuring the consistency of critical SAP data should not only focus on a single set of data but also should consider the dependencies within and between datasets and systems. For example, if you need to recover an SAP BW system, but not the source systems it pulls from, what would be the impact on change pointers and what mechanisms are in place to ensure a consistent recovery?

 **Suggestion 10.3.6 – Build a strategy for interfaces that permits data to be replayed or re-sent** 

For data exchange between systems, determine whether the integration is loosely coupled and if data can be replayed or re-sent, either at the source or target. Review if there are queuing capabilities to allow the scenario to be suspended or cached during an outage.

 **Suggestion 10.3.7 – Evaluate the use of a data bunker** 

Failure scenarios that result in the online data becoming unusable or unavailable due to situations such as accidental deletion or a malicious act might require a different approach to help ensure that data is protected or recoverable.

Although prevention is the best defense through a security framework covering network isolation and access control, the impact should be considered in the context of recovery and resilience.

 Using a *write only* backup account with a reduced retention period is a common approach for this rare but potentially high impact scenario. 
+  SAP Lens [Security]: [Best Practice 8.3 - Secure your data recovery mechanisms to protect against threats](best-practice-8-3.md) 

# Best Practice 10.4 – Validate the design against a set of criteria based on your business requirements
<a name="best-practice-10-4"></a>

Establish a set of criteria based on your business requirements, balancing the risk of failure, impact on the business, and acceptable trade-offs. Use these criteria to validate the design and make adjustments where necessary.

 **Suggestion 10.4.1 – Assess the cost to your business of an outage** 

Failures, of either AWS services or SAP components, will impact your SAP system differently depending on the resilience and recovery strategies. The type of failure will determine the duration of the outage (RTO) and the potential data loss (RPO).

For each failure, assess the risk of an outage and the cost to your business. For example, are there revenue generating processes that will be impacted and what is the hourly cost associated with the system not being available?

 **Suggestion 10.4.2 – Assess the cost of your architecture** 

 In SAP Landscapes, the largest elements of the AWS monthly bill typically are for Amazon EC2 and storage-related services. Understand the cost implications so that you select the best architecture to meet your reliability requirements. Key contributors include: 
+ Deployment patterns that don’t maximize hardware utilization
+ Redundant copies of data
+ Operating system license costs
+ Clustering software license costs
+ Costs associated with maintenance, testing, and skilled resources

 Refer to [Cost Optimization]: [Cost optimization Best Practices](cost-optimization.md) for further details. 

 **Suggestion 10.4.3 – Evaluate your design against other pillars in the framework** 

 Reliability cannot be designed in isolation, but should be assessed against the rest of the pillars of the AWS Well-Architected Framework. Example questions you might ask to evaluate this include: 
+ Operational excellence — Do you have the experience and skills to manage the solution?
+ Security — Is your data protected during replication, recovery, etc.
+ Performance — Does replication or the backup activity impact user performance?
+ Cost optimization — Does the cost of the solution align with the assumed risk?
+ Sustainability — Does the solution align with your sustainability and environmental impact initiatives?