# Operations
<a name="rhel-ase-ha-operations"></a>

This section covers the following topics.

**Topics**
+ [Analysis and maintenance](rhel-ase-ha-operations-topics.md)
+ [Testing](rhel-ase-testing.md)

# Analysis and maintenance
<a name="rhel-ase-ha-operations-topics"></a>

This section covers the following topics.

**Topics**
+ [Viewing the cluster state](#clsuter-state)
+ [Performing planned maintenance](#planned-maintenance)
+ [Post-failure analysis and reset](#analysis-reset)
+ [Alerting and monitoring](#alerting-monitoring)

## Viewing the cluster state
<a name="clsuter-state"></a>

You can view the state of the cluster based on your operating system.

 **Operating system based** 

There are multiple operating system commands that can be run as root or as a user with appropriate permissions. The commands enable you to get an overview of the status of the cluster and its services. See the following commands for more details.

```
pcs status
```

Sample output:

```
pcs status
Cluster name: rhelha
Cluster Summary:
* Stack: corosync
* Current DC: <rhxdbhost01> (version 2.1.2-4.el8_6.5-ada5c3b36e2) - partition with quorum
* Last updated: Wed Apr 12 19:38:46 2023
* Last change: Mon Apr 10 14:55:08 2023 by root via crm_resource on <rhxdbhost01>
* 2 nodes configured
* 10 resource instances configured
Node List:
* Online: [ awnulaeddb awnulaeddbha ]
Full List of Resources:
* rsc_aws_stonith_ARD (stonith:fence_aws): Started awnulaeddb
* Resource Group: grp_ARD_ASEDB:
* rsc_ip_ARD_ASEDB (ocf::heartbeat:aws-vpc-move-ip): Started <rhxdbhost01>
* rsc_fs_ARD_sybase (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
* rsc_fs_ARD_data (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
* rsc_fs_ARD_log (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
* rsc_fs_ARD_sapdiag (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
SAP NetWeaver on {aws} SAP NetWeaver Guides
Analysis and maintenance
140
* rsc_fs_ARD_saptmp (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
* rsc_fs_ARD_backup (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
* rsc_fs_ARD_usrsap (ocf::heartbeat:Filesystem): Started <rhxdbhost01>
* rsc_ase_ARD_ASEDB (ocf::heartbeat:SAPDatabase): Started <rhxdbhost01>
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
```

The following table provides a list of useful commands.


| Command | Description | 
| --- | --- | 
|   `crm_mon`   |  Display cluster status on the console with updates as they occur  | 
|   `crm_mon -1`   |  Display cluster status on the console just once, and exit  | 
|   `crm_mon -Arnf`   |  -A Display node attributes -n Group resources by node -r Display inactive resources -f Display resource fail counts  | 
|   `pcs help`   |  View more options  | 
|   `crm_mon --help-all`   |  View more options  | 

## Performing planned maintenance
<a name="planned-maintenance"></a>

The cluster connector is designed to integrate the cluster with SAP start framework (`sapstartsrv`), including the rolling kernel switch (RKS) awareness. Stopping and starting the SAP system using `sapcontrol` should not result in any cluster remediation activities as these actions are not interpreted as failures. Validate this scenario when testing your cluster.

There are different options to perform planned maintenance on nodes, resources, and the cluster.

**Topics**
+ [Maintenance mode](#maintenance-mode)
+ [Placing a node in standby mode](#node-standby)
+ [Moving a resource (not recommended)](#moving-resource)

### Maintenance mode
<a name="maintenance-mode"></a>

Use maintenance mode if you want to make any changes to the configuration or take control of the resources and nodes in the cluster. In most cases, this is the safest option for administrative tasks.

**Example**  
Use the following command to turn on maintenance mode.  

```
pcs property set maintenance-mode="true"
```
+ Use the following command to turn off maintenance mode.

  ```
  pcs property set maintenance-mode="false"
  ```

### Placing a node in standby mode
<a name="node-standby"></a>

To perform maintenance on the cluster without system outage, the recommended method for moving active resources is to place the node you want to remove from the cluster in standby mode.

```
pcs node standby <rhxdbhost01>
```

The cluster will cleanly relocate resources, and you can perform activities, including reboots on the node in standby mode. When maintenance activities are complete, you can re-introduce the node with the following command.

```
pcs node unstandby <rhxdbhost01>
```

### Moving a resource (not recommended)
<a name="moving-resource"></a>

Moving individual resources is not recommended because of the migration or move constraints that are created to lock the resource in its new location. These can be cleared as described in the info messages, but this introduces an additional setup.

```
<rhxdbhost01>:~ # pcs resource move <grp_ARD_ASEDB>

Note: Move constraint created for <grp_ARD_ASEDB> to <rhxdbhost02>
Note: Use “pcs constraint location remove cli-prefer-grp_ARD_ASEDB” to remove this constraint.
```

## Post-failure analysis and reset
<a name="analysis-reset"></a>

A review must be conducted after each failure to understand the source of failure as well the reaction of the cluster. In most scenarios, the cluster prevents an application outage. However, a manual action is often required to reset the cluster to a protective state for any subsequent failures.

**Topics**
+ [Checking the logs](#checking-logs)
+ [Cleanup `pcs status`](#cleanup-pcs)
+ [Restart failed nodes or `pacemaker`](#restart-nodes)
+ [Further analysis](#further-analysis)

### Checking the logs
<a name="checking-logs"></a>

Start your troubleshooting by checking the operating system log `/var/log/messages`. You can find additional information in the cluster and pacemaker logs.
+  **Cluster logs** – updated in the `corosync.conf` file located at `/etc/corosync/corosync.conf`.
+  **Pacemaker logs** – updated in the `pacemaker.log` file located at `/var/log/pacemaker`.
+  **Resource agents** – `/var/log/messages` 

Application based failures can be investigated in the SAP work directory.

### Cleanup `pcs status`
<a name="cleanup-pcs"></a>

If failed actions are reported using the `crm status` command, and if they have already been investigated, then you can clear the reports with the following command.

```
pcs resource cleanup <resource> <hostname>
```

```
pcs stonith cleanup
```

**Note**  
Use the help command to understand the impact of these commands.

### Restart failed nodes or `pacemaker`
<a name="restart-nodes"></a>

It is recommended that failed (or fenced) nodes are not automatically restarted. It gives operators a chance to investigate the failure, and ensure that the cluster doesn’t make assumptions about the state of resources.

You need to restart the instance or the pacemaker service based on your approach.

### Further analysis
<a name="further-analysis"></a>

If further analysis from Red Hat is required, they may request an sos report, or logs of the cluster from `crm_report` or `pcs cluster report`.

 **sos report** – The sos report command is a tool that collects configuration details, system information, and diagnostic information from a Red Hat Enterprise Linux system. For instance, the running kernel version, loaded modules, and system and service configuration files. The command also runs external programs to collect further information, and stores this output in the resulting archive. For more information, see Red Hat documentation [What is an sos report and is it different from an sosreport?](https://access.redhat.com/solutions/3592#sos_report) 

 **crm report** – collects the cluster logs/information from the node where the command is being run. For more information, see Red Hat documentation [How do I generate a crm\$1report from a RHEL 6 or 7 High Availability cluster node using pacemaker?](https://access.redhat.com/solutions/787853) 

```
crm_report
```

 **pcs cluster report** – command collects the cluster logs/information from all the nodes involved in the cluster.

```
pcs cluster report <destination_path>
```

**Note**  
The `pcs cluster report` command relies on passwordless ssh being set up between the nodes.

## Alerting and monitoring
<a name="alerting-monitoring"></a>

 **Using the cluster alert agents** 

Within the cluster configuration, you can call an external program (an alert agent) to handle alerts. This is a *push* notification. It passes information about the event via environment variables.

The agents can then be configured to send emails, log to a file, update a monitoring system, etc. For example, the following script can be used to access Amazon SNS.

```
#!/bin/sh

#alert_sns.sh
#modified from /usr/share/pacemaker/alerts/alert_smtp.sh.sample

##############################################################################
#SETUP
* Create an SNS Topic and subscribe email or chatbot
* Note down the ARN for the SNS topic
* Give the IAM Role attached to both Instances permission to publish to the SNS Topic
* Ensure the aws cli is installed
* Copy this file to /usr/share/pacemaker/alerts/alert_sns.sh or other location on BOTH nodes
* Ensure the permissions allow for hacluster and root to execute the script
* Run the following as root (modify file location if necessary and replace SNS ARN):

#SLES:
crm configure alert aws_sns_alert /usr/share/pacemaker/alerts/alert_sns.sh meta timeout=30s timestamp-format="%Y-%m-%d_%H:%M:%S" to <{ arn:aws:sns:region:account-id:myPacemakerAlerts  }>

#RHEL:
pcs alert create id=aws_sns_alert path=/usr/share/pacemaker/alerts/alert_sns.sh meta timeout=30s timestamp-format="%Y-%m-%d_%H:%M:%S"
pcs alert recipient add aws_sns_alert value=<arn:aws:sns:region:account-id:myPacemakerAlerts>

#Additional information to send with the alerts.
node_name=`uname -n`
sns_body=`env | grep CRM_alert_`

#Required for SNS
TOKEN=$(/usr/bin/curl --noproxy '*' -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

#Get metadata
REGION=$(/usr/bin/curl --noproxy '*' -w "\n" -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | grep region | awk -F\" '{print $4}')

sns_subscription_arn=${CRM_alert_recipient}

#Format depending on alert type
case ${CRM_alert_kind} in
   node)
     sns_subject="${CRM_alert_timestamp} ${cluster_name}: Node '${CRM_alert_node}' is now '${CRM_alert_desc}'"
   ;;
   fencing)
     sns_subject="${CRM_alert_timestamp} ${cluster_name}: Fencing ${CRM_alert_desc}"
   ;;
   resource)
     if [ ${CRM_alert_interval} = "0" ]; then
         CRM_alert_interval=""
     else
         CRM_alert_interval=" (${CRM_alert_interval})"
     fi
     if [ ${CRM_alert_target_rc} = "0" ]; then
         CRM_alert_target_rc=""
     else
         CRM_alert_target_rc=" (target: ${CRM_alert_target_rc})"
     fi
     case ${CRM_alert_desc} in
         Cancelled)
           ;;
         *)
           sns_subject="${CRM_alert_timestamp}: Resource operation '${CRM_alert_task}${CRM_alert_interval}' for '${CRM_alert_rsc}' on '${CRM_alert_node}': ${CRM_alert_desc}${CRM_alert_target_rc}"
           ;;
     esac
     ;;
   attribute)
     sns_subject="${CRM_alert_timestamp}: The '${CRM_alert_attribute_name}' attribute of the '${CRM_alert_node}' node was updated in '${CRM_alert_attribute_value}'"
     ;;
   *)
     sns_subject="${CRM_alert_timestamp}: Unhandled $CRM_alert_kind alert"
     ;;
esac

#Use this information to send the email.
aws sns publish --topic-arn "${sns_subscription_arn}" --subject "${sns_subject}" --message "${sns_body}" --region ${REGION}
```

# Testing
<a name="rhel-ase-testing"></a>

We recommend scheduling regular fault scenario recovery testing at least annually, and as part of the operating system or SAP kernel updates that may impact operations. For more details on best practices for regular testing, see SAP Lens – [Best Practice 4.3 – Regularly test business continuity plans and fault recovery](https://docs.aws.amazon.com/wellarchitected/latest/sap-lens/best-practice-4-3.html).

The tests described here simulate failures. These can help you understand the behavior and operational requirements of your cluster.

In addition to checking the state of cluster resources, ensure that the service you are trying to protect is in the required state. Can you still connect to SAP? Are locks still available in SM12?

Define the recovery time to ensure that it aligns with your business objectives. Record recovery actions in runbooks.

**Topics**
+ [Test 1: Stop SAP ASE database using `sapcontrol`](#test1)
+ [Test 2: Unmount FSx for ONTAP file system on primary host](#test2)
+ [Test 3: Kill the database processes on the primary host](#test3)
+ [Test 4: Simulate hardware failure of an individual node](#test4)
+ [Test 5: Simulate a network failure](#test5)
+ [Test 6: Simulate an NFS failure](#test6)
+ [Test 7: Accidental shutdown](#test7)

## Test 1: Stop SAP ASE database using `sapcontrol`
<a name="test1"></a>

 **Simulate failure** – On rhxdbhost01 as root:

```
/usr/sap/hostctrl/exe/saphostctrl -function StopDatabase -dbname ARD -dbtybe syb -force
```

 **Expected behavior** – SAP ASE database is stopped, and the `SAPDatabase` resource agent enters a failed state. The cluster will failover the database to the secondary instance.

 **Recovery action** – No action required.

## Test 2: Unmount FSx for ONTAP file system on primary host
<a name="test2"></a>

 **Simulate failure** – On rhxdbhost01 as root:

```
umount -l /sybase/ARD/sapdata_1
```

 **Expected behavior** – The `rsc_fs` resource enters a failed state. The cluster stops the SAP ASE database, and will failover to the secondary instance.

 **Recovery action** – No action required.

## Test 3: Kill the database processes on the primary host
<a name="test3"></a>

 **Simulate failure** – On rhxdbhost01 as root:

```
ps -ef |grep -i sybaard
kill -9 <PID>
```

 **Expected behavior** – SAP ASE database fails, and the `SAPDatabase ` resource enters a failed state. The cluster will failover the database to the secondary instance.

 **Recovery action** – No action required.

## Test 4: Simulate hardware failure of an individual node
<a name="test4"></a>

 **Notes** – To simulate a system crash, you must first ensure that `/proc/sys/kernel/sysrq` is set to 1.

 **Simulate failure** – On the primary host as root:

```
echo 'c' > /proc/sysrq-trigger
```

 **Expected behavior** – The node which has been killed fails. The cluster moves the resource (SAP ASE database) that was running on the failed node to the surviving node.

 **Recovery action** – Start the EC2 node.

## Test 5: Simulate a network failure
<a name="test5"></a>

 **Notes** – See the following list.
+ Iptables must be installed.
+ Use a subnet in this command because of the secondary ring.
+ Check for any existing iptables rules as iptables -F will flush all rules.
+ Review pcmk\$1delay and priority parameters if you see neither node survives the fence race.

 **Simulate failure** – On either node as root:

```
iptables -A INPUT -s <CIDR_of_other_subnet> -j DROP; iptables -A OUTPUT -d <CIDR_of_other_subnet> -j DROP
```

 **Expected behavior** – The cluster detects the network failure, and fences one of the nodes to avoid a split-brain situation.

 **Recovery action** – If the node where the command was run survives, execute iptables -F to clear the network failure. Start the EC2 node.

## Test 6: Simulate an NFS failure
<a name="test6"></a>

 **Notes** – See the following list.
+ Iptables must be installed.
+ Check for any existing iptables rules as iptables -F will flush all rules.
+ Although rare, this is an important scenario to test. Depending on the activity it may take some time (10 min \$1) to notice that I/O to EFS is not occurring and fail either the Filesystem or SAP resources.

 **Simulate failure** – On the primary host as root:

```
iptables -A OUTPUT -p tcp --dport 2049 -m state --state NEW,ESTABLISHED,RELATED -j DROP; iptables -A INPUT -p tcp --sport 2049 -m state --state ESTABLISHED -j DROP
```

 **Expected behavior** – The cluster detects that NFS is not available, and the `SAPDatabase` resource agent fails, and moves to the FAILED state.

 **Recovery action** – If the node where the command was run survives, execute iptables -F to clear the network failure. Start the EC2 node.

## Test 7: Accidental shutdown
<a name="test7"></a>

 **Notes** – See the following list.
+ Avoid shutdowns without cluster awareness.
+ We recommend the use of systemd to ensure predictable behaviour.
+ Ensure the resource dependencies are in place.

 **Simulate failure** – Login to AWS Management Console, and stop the instance or issue a shutdown command.

 **Expected behavior** – The node which has been shut down fails. The cluster moves the resource (SAP ASE database) that was running on the failed node to the surviving node. If systemd and resource dependencies are not configured, you may notice that while the EC2 instance is shutting down gracefully, the cluster will detect an unclean stop of cluster services on the node and will fence the EC2 instance being shut down.

 **Recovery action** – Start the EC2 node and pacemaker service.