A kernel panic has occurred on a member host that is due to a IBM® General Parallel File System (GPFS™) trigger. The trigger repeats on a sporadic but recurring basis.

Symptoms

The output of the db2instance -list command includes a pending failback operation, as shown in the following example:

ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER WAITING_FOR_FAILBACK hostA hostB NO 0 1 hostB-ib0
1 MEMBER STARTED hostB hostB NO 0 0 hostB-ib0
2 MEMBER STARTED hostC hostC NO 0 0 hostC-ib0
128 CF PRIMARY hostD hostD NO - 0 hostD-ib0
129 CF PEER hostE hostE NO - 0 hostE-ib0

HOSTNAME STATE INSTANCE_STOPPED ALERT
——– —– —————- —–
hostA INACTIVE NO YES
hostB ACTIVE NO NO
hostC ACTIVE NO NO
hostD ACTIVE NO NO
hostE ACTIVE NO NO

In the previous example, hostA has a state of INACTIVE, and an ALERT field is marked as YES. This output of the db2instance -list command is seen when hostA is offline or rebooting. Since the home host for member 0, hostA is offline, member 0 has failed over to hostB. Member 0 is now waiting to failback to its home host, as indicated by the WAITING_FOR_FAILBACK state. After hostA is rebooted from the panic, member 1 will fail back to hostA.

Diagnosis

When you check the db2diag log file, you can find many log entries that indicate that a restart light operation has occurred, as shown in the following example:

2009-08-27-23.37.52.416270-240 I6733A457 LEVEL: Event
PID : 1093874 TID : 1 KTID : 2461779
PROC : db2star2
INSTANCE: NODE : 000
HOSTNAME: hostB
EDUID : 1
FUNCTION: DB2 UDB, base sys utilities, DB2StartMain, probe:3368
MESSAGE : Idle process taken over by member
DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
996
DATA #2 : Database Partition Number, PD_TYPE_NODE, 2 bytes
0

Another way to diagnose this type of problem is to check the system log. Run the OS command errpt -a to view the contents of the AIX® errpt system log. In the AIX errpt log, you might see log entries similar in the following example, which is for hostA:

LABEL: KERNEL_PANIC
IDENTIFIER: 225E3B63

Date/Time: Mon May 26 08:02:03 EDT 2008
Sequence Number: 976
Machine Id: 0006DA8AD700
Node Id: hostA
Class: S
Type: TEMP
Resource Name: PANIC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
ASSERT STRING
5.1: xmemout succeeded rc=d

PANIC STRING
kx.C:2024:0:0:04A53FA8::advObjP == ofP->advLkObjP

If you see a

KERNEL_PANIC

log entry as shown in the previous example, the system reboot might be due to an operating system kernel panic that was triggered by a problem in the GPFS subsystem. A kernel panic and system reboot can be the result of excessive processor usage or heavy paging on the system when the GPFS daemons do not receive enough system resources to perform critical tasks. If you experience GPFS filesystem outages that are related to kernel panics, the underlying processor usage or paging issues must be resolved first. If you cannot resolve the underlying issues, run the db2support command for the database with the -s parameter to collect diagnostic information and contact IBM Technical Support.