

# Operations
<a name="sap-hana-pacemaker-rhel-operations"></a>

**Topics**
+ [Viewing the cluster state](sap-hana-pacemaker-rhel-ops-cluster-state.md)
+ [Performing planned maintenance](sap-hana-pacemaker-rhel-ops-planned-maint.md)
+ [Post-failure analysis and reset](sap-hana-pacemaker-rhel-ops-post-failure.md)
+ [Alerting and monitoring](sap-hana-pacemaker-rhel-ops-alert-monitor.md)

# Viewing the cluster state
<a name="sap-hana-pacemaker-rhel-ops-cluster-state"></a>

**Topics**
+ [Operating system based](#_operating_system_based)

## Operating system based
<a name="_operating_system_based"></a>

There are multiple operating system commands that can be run as root or as a user with appropriate permissions. The commands enable you to get an overview of the status of the cluster and its services.

```
# pcs status --full
```

Note: Omit the `--full` for a more concise output if you do not need to view the node attributes.

Sample output:

```
Cluster name: hacluster
Cluster Summary:
  * Stack: corosync
  * Current DC: hanahost02 (version 2.1.2-4.el9_0.5-ada5c3b36e2) - partition with quorum
  * Last updated: Tue Jun  3 15:47:15 2025
  * Last change:  Tue Jun  3 15:47:12 2025 by hacluster via crmd on hanahost02
  * 2 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ hanahost01 hanahost02 ]

Full List of Resources:
  * rsc_fence_aws       (stonith:fence_aws):     Started hanahost01
  * rsc_ip_HDB_HDB00    (ocf:heartbeat:aws-vpc-move-ip):         Stopped
  * Clone Set: rsc_SAPHanaTopology_HDB_HDB00-clone [rsc_SAPHanaTopology_HDB_HDB00]:
    * Started: [ hanahost01 hanahost02 ]
  * Clone Set: rsc_SAPHana_HDB_HDB00-clone [rsc_SAPHana_HDB_HDB00] (promotable):
    * Promoted: [ hanahost02 ]
    * Unpromoted: [ hanahost01 ]

Node Attributes:
  * Node: hanahost01 (1):
    * hana_hdb_clone_state              : PROMOTED
    * hana_hdb_op_mode                  : logreplay
    * hana_hdb_remoteHost               : hanavirt02
    * hana_hdb_roles                    : 4:P:master1:master:worker:master
    * hana_hdb_site                     : siteA
    * hana_hdb_srah                     : -
    * hana_hdb_srmode                   : syncmem
    * hana_hdb_sync_state               : PRIM
    * hana_hdb_version                  : 2.00.073.00
    * hana_hdb_vhost                    : hanavirt01
    * lpa_hdb_lpt                       : 1755493611
    * master-rsc_SAPHana_HDB_HDB00      : 150
  * Node: hanahost02 (2):
    * hana_hdb_clone_state              : DEMOTED
    * hana_hdb_op_mode                  : logreplay
    * hana_hdb_remoteHost               : hanavirt01
    * hana_hdb_roles                    : 4:S:master1:master:worker:master
    * hana_hdb_site                     : siteB
    * hana_hdb_srah                     : -
    * hana_hdb_srmode                   : syncmem
    * hana_hdb_sync_state               : SOK
    * hana_hdb_version                  : 2.00.073.00
    * hana_hdb_vhost                    : hanavirt02
    * lpa_hdb_lpt                       : 30
    * master-rsc_SAPHana_HDB_HDB00      : 100

Migration Summary:

Tickets:

PCSD Status:
  hanahost01: Online
  hanahost02: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
```

The following table provides a list of useful commands.


| Command | Description | 
| --- | --- | 
|   `crm_mon`   |  Display cluster status on the console with updates as they occur  | 
|   `crm_mon -1`   |  Display cluster status on the console just once, and exit  | 
|   `crm_mon -Arnf`   |  -A Display node attributes -n Group resources by node -r Display inactive resources -f Display resource fail counts  | 
|   `crm help`   |  View more options  | 
|   `crm_mon --help-all`   |  View more options  | 

# Performing planned maintenance
<a name="sap-hana-pacemaker-rhel-ops-planned-maint"></a>

When performing maintenance on SAP HANA systems in a cluster environment, it’s important to understand how the cluster interacts with SAP HANA system replication. Planned maintenance activities should be conducted carefully to prevent unnecessary failovers or cluster interventions.

There are different options to perform planned maintenance on nodes, resources, and the cluster.

**Topics**
+ [Maintenance mode](#_maintenance_mode)
+ [Placing a node in standby mode](#_placing_a_node_in_standby_mode)
+ [Moving a resource](#_moving_a_resource)

## Maintenance mode
<a name="_maintenance_mode"></a>

Use maintenance mode if you want to make any changes to the configuration or take control of the resources and nodes in the cluster. In most cases, this is the safest option for administrative tasks.

**Example**  
Use the following commands to turn on maintenance mode.  

```
# pcs property maintenance-mode=true
```
Use the following command to turn off maintenance mode.  

```
# pcs property maintenance-mode=false
```

## Placing a node in standby mode
<a name="_placing_a_node_in_standby_mode"></a>

To perform maintenance on the cluster without a full system outage, the recommended method for moving active resources is to place the node you want to remove from the cluster in standby mode.

```
# pcs node standby <hostname>
```

The cluster will cleanly relocate resources, and you can perform activities, including reboots on the node in standby mode. When maintenance activities are complete, you can re-introduce the node with the following command.

```
# pcs node unstandby <hostname>
```

## Moving a resource
<a name="_moving_a_resource"></a>

When moving individual resources, be sure you understand resource dependencies and constraints. The following commands demonstrate how to force a HANA takeover. Always review the cluster status and verify any temporary location constraints afterwards.

For example:

```
# pcs resource move rsc_SAPHana_HDB_HDB00-clone hanahost02
Location constraint to move resource 'rsc_SAPHana_HDB_HDB00-clone' has been created
Waiting for the cluster to apply configuration changes...
Location constraint created to move resource 'rsc_SAPHana_HDB_HDB00-clone' has been removed
Waiting for the cluster to apply configuration changes...
resource 'rsc_SAPHana_HDB_HDB00-clone' is promoted on node 'hanahost02'; unpromoted on node 'hanahost01'
```

Note: The exact resource name will vary depending on your SAP HANA system ID and instance number. Adjust the commands accordingly.

# Post-failure analysis and reset
<a name="sap-hana-pacemaker-rhel-ops-post-failure"></a>

A review must be conducted after each failure to understand the source of failure as well the reaction of the cluster. In most scenarios, the cluster prevents an application outage. However, a manual action is often required to reset the cluster to a protective state for any subsequent failures.

**Topics**
+ [Checking the Logs](#_checking_the_logs)
+ [Cleanup pcs status](#_cleanup_pcs_status)
+ [Restart failed nodes or pacemaker](#_restart_failed_nodes_or_pacemaker)
+ [Further Analysis](#_further_analysis)

## Checking the Logs
<a name="_checking_the_logs"></a>
+ For troubleshooting cluster issues, use journalctl to examine both pacemaker and corosync logs:

  ```
  # journalctl -u pacemaker -u corosync --since "1 hour ago"
  ```
  + Use `--since` to specify time periods (e.g., "2 hours ago", "today")
  + Add `-f` to follow logs in real-time
  + Combine with grep for specific searches
+ System messages and resource agent activity can be found in `/var/log/messages`.
+ For HANA-specific issues, check the HANA trace directory. This can be reached using 'cdtrace' when logged in as <sid>adm. Also consult the DB\$1<tenantdb> directory within the HANA trace directory.

## Cleanup pcs status
<a name="_cleanup_pcs_status"></a>

If failed actions are reported using the `pcs status` command, and if they have already been investigated, then you can clear the reports with the following command.

```
# pcs resource cleanup <resource> <hostname>
```

## Restart failed nodes or pacemaker
<a name="_restart_failed_nodes_or_pacemaker"></a>

It is recommended that failed (or fenced) nodes are not automatically restarted. It gives operators a chance to investigate the failure, and ensure that the cluster doesn’t make assumptions about the state of resources.

You need to restart the instance or the pacemaker service based on your approach.

## Further Analysis
<a name="_further_analysis"></a>

For cluster-specific issues, use `pcs cluster report` to generate a targeted analysis of cluster components across all nodes:

```
# pcs cluster report --from="YYYY-MM-DD HH:MM:SS" --to="YYYY-MM-DD HH:MM:SS" /tmp/cluster-report
```

**Using pcs cluster report**
+ Specify a time range that encompasses the incident
+ The report includes logs and configuration from all nodes
+ Review the generated tarball for cluster events, resource operations, and configuration changes

# Alerting and monitoring
<a name="sap-hana-pacemaker-rhel-ops-alert-monitor"></a>

This section covers the following topics.

**Topics**
+ [Using Amazon CloudWatch Application Insights](#_using_amazon_cloudwatch_application_insights)
+ [Using the cluster alert agents](#_using_the_cluster_alert_agents)

## Using Amazon CloudWatch Application Insights
<a name="_using_amazon_cloudwatch_application_insights"></a>

For monitoring and visibility of cluster state and actions, Application Insights includes metrics for monitoring enqueue replication state, cluster metrics, and SAP and high availability checks. Additional metrics, such as EFS and CPU monitoring can also help with root cause analysis.

For more information, see [Get started with Amazon CloudWatch Application Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/appinsights-getting-started.html) and [SAP HANA High Availability on Amazon EC2](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/component-configuration-examples-hana-ha.html).

## Using the cluster alert agents
<a name="_using_the_cluster_alert_agents"></a>

Within the cluster configuration, you can call an external program (an alert agent) to handle alerts. This is a *push* notification. It passes information about the event via environment variables.

The agents can then be configured to send emails, log to a file, update a monitoring system, etc. For example, the following script can be used to access Amazon SNS.

```
#!/bin/sh

# alert_sns.sh
# modified from /usr/share/pacemaker/alerts/alert_smtp.sh.sample

##############################################################################
# SETUP
# * Create an SNS Topic and subscribe email or chatbot
# * Note down the ARN for the SNS topic
# * Give the IAM Role attached to both Instances permission to publish to the SNS Topic
# * Ensure the aws cli is installed
# * Copy this file to /usr/share/pacemaker/alerts/alert_sns.sh or other location on BOTH nodes
# * Ensure the permissions allow for hacluster and root to execute the script
# * Run the following as root (modify file location if necessary and replace SNS ARN):
#
# SLES:
# crm configure alert aws_sns_alert /usr/share/pacemaker/alerts/alert_sns.sh meta timeout=30s timestamp-format="%Y-%m-%d_%H:%M:%S" to <{ arn:aws:sns:region:account-id:myPacemakerAlerts  }>
#
# RHEL:
# pcs alert create id=aws_sns_alert path=/usr/share/pacemaker/alerts/alert_sns.sh meta timeout=30s timestamp-format="%Y-%m-%d_%H:%M:%S"
# pcs alert recipient add aws_sns_alert value=arn:aws:sns:region:account-id:myPacemakerAlerts
##############################################################################

# Additional information to send with the alerts
node_name=`uname -n`
sns_body=`env | grep CRM_alert_`

# Required for SNS
TOKEN=$(/usr/bin/curl --noproxy '*' -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

# Get metadata
REGION=$(/usr/bin/curl --noproxy '*' -w "\n" -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | grep region | awk -F\" '{print $4}')

sns_subscription_arn=${CRM_alert_recipient}

# Format depending on alert type
case ${CRM_alert_kind} in
   node)
     sns_subject="${CRM_alert_timestamp} ${cluster_name}: Node '${CRM_alert_node}' is now '${CRM_alert_desc}'"
   ;;
   fencing)
     sns_subject="${CRM_alert_timestamp} ${cluster_name}: Fencing ${CRM_alert_desc}"
   ;;
   resource)
     if [ ${CRM_alert_interval} = "0" ]; then
         CRM_alert_interval=""
     else
         CRM_alert_interval=" (${CRM_alert_interval})"
     fi
     if [ ${CRM_alert_target_rc} = "0" ]; then
         CRM_alert_target_rc=""
     else
         CRM_alert_target_rc=" (target: ${CRM_alert_target_rc})"
     fi
     case ${CRM_alert_desc} in
         Cancelled)
           ;;
         *)
           sns_subject="${CRM_alert_timestamp}: Resource operation '${CRM_alert_task}${CRM_alert_interval}' for '${CRM_alert_rsc}' on '${CRM_alert_node}': ${CRM_alert_desc}${CRM_alert_target_rc}"
           ;;
     esac
     ;;
   attribute)
     sns_subject="${CRM_alert_timestamp}: The '${CRM_alert_attribute_name}' attribute of the '${CRM_alert_node}' node was updated in '${CRM_alert_attribute_value}'"
     ;;
   *)
     sns_subject="${CRM_alert_timestamp}: Unhandled $CRM_alert_kind alert"
     ;;
esac

# Use this information to send the email
aws sns publish --topic-arn "${sns_subscription_arn}" --subject "${sns_subject}" --message "${sns_body}" --region ${REGION}
```