Problem(Abstract)

This document will discuss the various ways to determine if RSCT has rebooted your node.

Symptom

Node reboots without operator issuing the reboot command

Resolving the problem

TSAMP cannot reboot your node, there is no functionality built into the core TSAMP application that allows for this functionality. However, RSCT (Reliable, Scalable Cluster Technology), the cluster provider that TSAMP ‘rides’ upon can and will reboot your node given a few different situations. This technote will not discuss why the node was rebooted, only some of the ways to determine if RSCT was the culprit who initiated the reboot.

Syslogs:

The easiest way to see that RSCT has rebooted a node is to check your syslogs for the following message:
Jan 11 10:15:38 node03 ConfigRM[1418]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PeerDomain.C,1.99.25.16,18148            :::CONFIGRM_NOQUORUM_ER#012The operational quorum state of the active peer domain has changed to NO_QUORUM. #012This indicates that recovery of cluster resources can no longer occur and that #012the node may be rebooted or halted in order to ensure that critical resources #012are released so that they can be recovered by another sub-domain that may have #012operational quorum.

Jan 11 10:15:38 node03 ConfigRM[1418]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID:  :::Template ID: 0:::Details File:  :::Location: RSCT,PeerDomain.C,1.99.25.16,21028            :::CONFIGRM_REBOOTOS_ER#012The operating system is being rebooted to ensure that critical resources are #012stopped so that another sub-domain that has operational quorum may recover #012these resources without causing corruption or conflict.

The easiest way to do this is to grep “REBOOTOS” from your syslogs output file.

ConfigRM Trace File:
Your IBM.ConfigRM trace file can be found in the following directory and needs to be formatted to be read:
To format the file:
rpttr -odtic /var/ct/IW/log/mc/IBM.ConfigRM trace > /tmp/IBM.ConfigRM_Trace.out
** Note: Your specific file names for the trace file may differ from what is shown above **

01/11/2011 10:15:38 AM.886775 T(11844464) _CFD !!!!!!!!!!!!!!!!! PeerDomainRcp::haltOS Entered. !!!!!!!!!!!!!!!!!!!!!
/var/ct/IW/log/mc/IBM.ConfigRM/
01/11/2011 10:15:38 AM.886856 T(11844464) _CFD logerr: In file=/project/spreljan/build/rjans002a/src/rsct/rm/ConfigRM/PeerDomain.C (Version=1.99.25.16 Line=21028) :
CONFIGRM_REBOOTOS_ER
The operating system is being rebooted to ensure that critical resources are
stopped so that another sub-domain that has operational quorum may recover
these resources without causing corruption or conflict.

After running the above rpttr command to format the traces, grep the word “rebootos” from the output file.

Error Report (AIX only):
Create or view the error report with the following command:
errpt -a > /tmp/error_report.out

Search for the following message:
-----------------------------------------------------------------------
LABEL: KERNEL_PANIC
IDENTIFIER: 225E3B63

Date/Time:       Fri Sep  9 17:35:14 2011
Sequence Number: 163592
Machine Id:      00C54C6E4C00
Node Id:         ccdev32
Class:           S
Type:            TEMP
WPAR:            Global
Resource Name:   PANIC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ASSERT STRING

PANIC STRING
RSCT reboot caused by critical resource protection
-----------------------------------------------------------------------

On both AIX 5.3 and AIX 6.1 the “IDENTIFIER” is the same value so searching the error report for “225E3B63″ will locate and identify the RSCT reboot.

Event Viewer (Windows only):
Search the event viewer for “0xDEADDEAD” to find the RSCT initiated reboots.