rac节点频繁重启的案例

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

这是一个网友的信息，其RAC频繁重启，从9月份以来几乎每间隔2天就要重启，我们先来看看日志。

节点1的alert log：

Tue Oct 28 10:51:40 2014
Thread 1 advanced to log sequence 22792 (LGWR switch)
Current log# 107 seq# 22792 mem# 0: +ORADATA/portaldb/redo107.log
Tue Oct 28 10:57:16 2014
Thread 1 advanced to log sequence 22793 (LGWR switch)
Current log# 108 seq# 22793 mem# 0: +ORADATA/portaldb/redo108.log
Tue Oct 28 11:04:07 2014
Reconfiguration started (old inc 48, new inc 50)
List of instances:
1 (myinst: 1) 
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE 
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Oct 28 11:04:09 2014
Tue Oct 28 11:04:09 2014

Tue Oct 28 10:51:40 2014

Thread 1 advanced to log sequence 22792 (LGWR switch)

Current log# 107 seq# 22792 mem# 0: +ORADATA/portaldb/redo107.log

Tue Oct 28 10:57:16 2014

Thread 1 advanced to log sequence 22793 (LGWR switch)

Current log# 108 seq# 22793 mem# 0: +ORADATA/portaldb/redo108.log

Tue Oct 28 11:04:07 2014

Reconfiguration started (old inc 48, new inc 50)

List of instances:

1 (myinst: 1)

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Tue Oct 28 11:04:09 2014

我们接着来看下节点1的crs的alert log信息：

2014-10-23 07:19:57.145
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.732 seconds
2014-10-23 07:20:05.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.708 seconds
2014-10-23 07:20:09.175
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.702 seconds
2014-10-23 07:20:11.880
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564863; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-23 09:58:11.620
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.141 seconds
2014-10-23 09:58:18.634
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.126 seconds
2014-10-23 09:58:23.660
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.100 seconds
2014-10-23 09:58:25.763
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564865; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-23 14:31:07.140
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.105 seconds
2014-10-23 14:31:14.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.075 seconds
2014-10-23 14:31:19.181
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.063 seconds
2014-10-23 14:31:21.246
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564867; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-25 06:02:39.191
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.748 seconds
2014-10-25 06:02:47.197
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.742 seconds
2014-10-25 06:02:51.203
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.736 seconds
2014-10-25 06:02:53.941
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564869; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
........
2014-10-25 06:04:02.765
[crsd(6815946)]CRS-2772:Server 'xxdb2' has been assigned to pool 'ora.portaldb'.
2014-10-28 11:03:48.965
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.023 seconds
2014-10-28 11:03:55.990
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.998 seconds
2014-10-28 11:04:00.008
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.979 seconds
2014-10-28 11:04:02.988
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564871; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
2014-10-28 11:04:05.992
[cssd(6095264)]CRS-1625:Node xxdb2, number 2, was manually shut down
2014-10-28 11:04:05.998
........
[cssd(6095264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxdb1 xxdb2 .

2014-10-23 07:19:57.145

[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.732 seconds

2014-10-23 07:20:05.169

[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.708 seconds

2014-10-23 07:20:09.175

[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.702 seconds

2014-10-23 07:20:11.880

[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564863; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.

........

2014-10-23 09:58:11.620

[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.141 seconds

2014-10-23 09:58:18.634

[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.126 seconds

2014-10-23 09:58:23.660

[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.100 seconds

2014-10-23 09:58:25.763

[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564865; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.

........

2014-10-23 14:31:07.140

[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.105 seconds

2014-10-23 14:31:14.169

[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.075 seconds

2014-10-23 14:31:19.181

[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.063 seconds

2014-10-23 14:31:21.246

[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564867; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.

........

2014-10-25 06:02:39.191

[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.748 seconds

2014-10-25 06:02:47.197

[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.742 seconds

2014-10-25 06:02:51.203

[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.736 seconds

2014-10-25 06:02:53.941

[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564869; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.

........

2014-10-25 06:04:02.765

[crsd(6815946)]CRS-2772:Server 'xxdb2' has been assigned to pool 'ora.portaldb'.

2014-10-28 11:03:48.965

[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.023 seconds

2014-10-28 11:03:55.990

[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.998 seconds

2014-10-28 11:04:00.008

[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.979 seconds

2014-10-28 11:04:02.988

[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564871; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.

2014-10-28 11:04:05.992

[cssd(6095264)]CRS-1625:Node xxdb2, number 2, was manually shut down

2014-10-28 11:04:05.998

........

[cssd(6095264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxdb1 xxdb2 .

从节点1的crs alert log来看，23号，25号以及28号都出现了节点驱逐的情况。单从上面的信息来看，似乎跟网络有关系。

节点1的ocssd.log 如下：

2014-10-28 11:03:47.010: [    CSSD][1029]clssscSelect: cookie accept request 110987400
2014-10-28 11:03:47.010: [    CSSD][1029]clssgmAllocProc: (113c7b590) allocated
2014-10-28 11:03:47.016: [    CSSD][1029]clssgmClientConnectMsg: properties of cmProc 113c7b590 - 0,1,2,3,4
2014-10-28 11:03:47.016: [    CSSD][1029]clssgmClientConnectMsg: Connect from con(1d287ee) proc(113c7b590) pid(26345486) version 11:2:1:4, properties: 0,1,2,3,4
2014-10-28 11:03:47.016: [    CSSD][1029]clssgmClientConnectMsg: msg flags 0x0000
2014-10-28 11:03:47.061: [    CSSD][2577]clssnmSetupReadLease: status 1
........
2014-10-28 11:03:48.965: [    CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 50% heartbeat fatal, removal in 14.023 seconds
2014-10-28 11:03:48.965: [    CSSD][3605]clssnmPollingThread: node xxdb2 (2) is impending reconfig, flag 2294796, misstime 15977
2014-10-28 11:03:48.965: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-10-28 11:03:48.965: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917536, LATS 2041808730, lastSeqNo 41914523, uniqueness 1414188209, timestamp 1414465442/2128363897
......
2014-10-28 11:03:49.611: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917537, LATS 2041809376, lastSeqNo 41917536, uniqueness 1414188209, timestamp 1414465443/2128364399
2014-10-28 11:03:49.612: [    CSSD][2577]clssnmSetupReadLease: status 1
2014-10-28 11:03:49.617: [    CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1
2014-10-28 11:03:49.617: [    CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
2014-10-28 11:03:49.619: [    CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (113c9aad0), operation status 1, client status 0
2014-10-28 11:03:49.633: [    CSSD][2577]clssnmCompleteGMReq: Completed request type 18 with status 1
2014-10-28 11:03:49.633: [    CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
2014-10-28 11:03:49.635: [    CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 18 for proc (113c9aad0), operation status 1, client status 0
2014-10-28 11:03:49.671: [    CSSD][1029]clssnmGetNodeNumber: xxdb1 
2014-10-28 11:03:49.725: [    CSSD][1029]clssnmGetNodeNumber: xxdb2 
.......
2014-10-28 11:03:49.969: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917538, LATS 2041809734, lastSeqNo 41917537, uniqueness 1414188209, timestamp 1414465443/2128364901
.......
2014-10-28 11:03:50.970: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917540, LATS 2041810735, lastSeqNo 41917538, uniqueness 1414188209, timestamp 1414465444/2128365902
.......
2014-10-28 11:03:51.248: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2014-10-28 11:03:51.248: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2014-10-28 11:03:51.975: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917542, LATS 2041811740, lastSeqNo 41917540, uniqueness 1414188209, timestamp 1414465445/2128366906
.......
2014-10-28 11:04:00.007: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917558, LATS 2041819771, lastSeqNo 41917556, uniqueness 1414188209, timestamp 1414465453/2128374949
2014-10-28 11:04:00.008: [    CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 90% heartbeat fatal, removal in 2.979 seconds, seedhbimpd 1
2014-10-28 11:04:01.010: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917560, LATS 2041820775, lastSeqNo 41917558, uniqueness 1414188209, timestamp 1414465454/2128375951
2014-10-28 11:04:02.012: [    CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917562, LATS 2041821776, lastSeqNo 41917560, uniqueness 1414188209, timestamp 1414465455/2128376952
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmPollingThread: Removal started for node xxdb2 (2), flags 0x23040c, state 3, wt4c 0
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmMarkNodeForRemoval: node 2, xxdb2 marked for removal
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmDiscHelper: xxdb2, node(2) connection failed, endp (c1da79), probe(0), ninf-&gt;endp c1da79
2014-10-28 11:04:02.988: [    CSSD][3605]clssnmDiscHelper: node 2 clean up, endp (c1da79), init state 5, cur state 5
2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcInternalDissociate: obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871
2014-10-28 11:04:02.988: [    CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 1, from -1, changes 61
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmSetupAckWait: Ack message type (11) 
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete
........
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmNeedConfReq: No configuration to change
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, xxdb2, misstime(30001) state(5)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2014-10-28 11:04:02.988: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 1, events 0xa0, state 0x0
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmCheckSplit: Node 2, xxdb2, is alive, DHB (1414465455, 2128376952) more than disk timeout of 27000 after the last NHB (1414465426, 2128347958)
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 1
2014-10-28 11:04:02.988: [    CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 0, events 0x20, state 0x0
2014-10-28 11:04:02.988: [    CSSD][4119]clssnmRemove: Start
2014-10-28 11:04:02.988: [    CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, xxdb2, from the cluster in incarnation 265564871, node birth incarnation 265564870, death incarnation 265564871, stateflags 0x234000 uniqueness value 1414188209
........
2014-10-28 11:04:02.989: [    CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
2014-10-28 11:04:02.989: [    CSSD][4119]clssnmSendShutdown: req to node 2, kill time 2041822753
2014-10-28 11:04:02.989: [    CSSD][4119]clssnmsendmsg: not connected to node 2

2014-10-28 11:03:47.010: [ CSSD][1029]clssscSelect: cookie accept request 110987400

2014-10-28 11:03:47.010: [ CSSD][1029]clssgmAllocProc: (113c7b590) allocated

2014-10-28 11:03:47.016: [ CSSD][1029]clssgmClientConnectMsg: properties of cmProc 113c7b590 - 0,1,2,3,4

2014-10-28 11:03:47.016: [ CSSD][1029]clssgmClientConnectMsg: Connect from con(1d287ee) proc(113c7b590) pid(26345486) version 11:2:1:4, properties: 0,1,2,3,4

2014-10-28 11:03:47.016: [ CSSD][1029]clssgmClientConnectMsg: msg flags 0x0000

2014-10-28 11:03:47.061: [ CSSD][2577]clssnmSetupReadLease: status 1

........

2014-10-28 11:03:48.965: [ CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 50% heartbeat fatal, removal in 14.023 seconds

2014-10-28 11:03:48.965: [ CSSD][3605]clssnmPollingThread: node xxdb2 (2) is impending reconfig, flag 2294796, misstime 15977

2014-10-28 11:03:48.965: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)

2014-10-28 11:03:48.965: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917536, LATS 2041808730, lastSeqNo 41914523, uniqueness 1414188209, timestamp 1414465442/2128363897

......

2014-10-28 11:03:49.611: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917537, LATS 2041809376, lastSeqNo 41917536, uniqueness 1414188209, timestamp 1414465443/2128364399

2014-10-28 11:03:49.612: [ CSSD][2577]clssnmSetupReadLease: status 1

2014-10-28 11:03:49.617: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1

2014-10-28 11:03:49.617: [ CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1

2014-10-28 11:03:49.619: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (113c9aad0), operation status 1, client status 0

2014-10-28 11:03:49.633: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 18 with status 1

2014-10-28 11:03:49.633: [ CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1

2014-10-28 11:03:49.635: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 18 for proc (113c9aad0), operation status 1, client status 0

2014-10-28 11:03:49.671: [ CSSD][1029]clssnmGetNodeNumber: xxdb1

2014-10-28 11:03:49.725: [ CSSD][1029]clssnmGetNodeNumber: xxdb2

.......

2014-10-28 11:03:49.969: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917538, LATS 2041809734, lastSeqNo 41917537, uniqueness 1414188209, timestamp 1414465443/2128364901

.......

2014-10-28 11:03:50.970: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917540, LATS 2041810735, lastSeqNo 41917538, uniqueness 1414188209, timestamp 1414465444/2128365902

.......

2014-10-28 11:03:51.248: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2014-10-28 11:03:51.248: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2014-10-28 11:03:51.975: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917542, LATS 2041811740, lastSeqNo 41917540, uniqueness 1414188209, timestamp 1414465445/2128366906

.......

2014-10-28 11:04:00.007: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917558, LATS 2041819771, lastSeqNo 41917556, uniqueness 1414188209, timestamp 1414465453/2128374949

2014-10-28 11:04:00.008: [ CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 90% heartbeat fatal, removal in 2.979 seconds, seedhbimpd 1

2014-10-28 11:04:01.010: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917560, LATS 2041820775, lastSeqNo 41917558, uniqueness 1414188209, timestamp 1414465454/2128375951

2014-10-28 11:04:02.012: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917562, LATS 2041821776, lastSeqNo 41917560, uniqueness 1414188209, timestamp 1414465455/2128376952

2014-10-28 11:04:02.988: [ CSSD][3605]clssnmPollingThread: Removal started for node xxdb2 (2), flags 0x23040c, state 3, wt4c 0

2014-10-28 11:04:02.988: [ CSSD][3605]clssnmMarkNodeForRemoval: node 2, xxdb2 marked for removal

2014-10-28 11:04:02.988: [ CSSD][3605]clssnmDiscHelper: xxdb2, node(2) connection failed, endp (c1da79), probe(0), ninf->endp c1da79

2014-10-28 11:04:02.988: [ CSSD][3605]clssnmDiscHelper: node 2 clean up, endp (c1da79), init state 5, cur state 5

2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcInternalDissociate: obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)

2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871

2014-10-28 11:04:02.988: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 61

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11)

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete

........

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmNeedConfReq: No configuration to change

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, xxdb2, misstime(30001) state(5)

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2014-10-28 11:04:02.988: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 1, events 0xa0, state 0x0

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmCheckSplit: Node 2, xxdb2, is alive, DHB (1414465455, 2128376952) more than disk timeout of 27000 after the last NHB (1414465426, 2128347958)

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 1

2014-10-28 11:04:02.988: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 0, events 0x20, state 0x0

2014-10-28 11:04:02.988: [ CSSD][4119]clssnmRemove: Start

2014-10-28 11:04:02.988: [ CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, xxdb2, from the cluster in incarnation 265564871, node birth incarnation 265564870, death incarnation 265564871, stateflags 0x234000 uniqueness value 1414188209

........

2014-10-28 11:04:02.989: [ CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0

2014-10-28 11:04:02.989: [ CSSD][4119]clssnmSendShutdown: req to node 2, kill time 2041822753

2014-10-28 11:04:02.989: [ CSSD][4119]clssnmsendmsg: not connected to node 2

可以看到28号，oracle的clssnmPollingThread函数调用出现异常。oracle本质上是通过调用该函数来判断集群节点的心跳是否正常。
从上面的信息来看，似乎提示的是节点2没有网络心跳。

下面的重点是分析节点2的日志了，首先来看下节点2的alert log：

Tue Oct 28 10:59:40 2014
Thread 2 advanced to log sequence 26516 (LGWR switch)
Current log# 208 seq# 26516 mem# 0: +ORADATA/portaldb/redo208.log
Tue Oct 28 11:04:18 2014
NOTE: ASMB terminating
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 1025 Serial number: 3
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 1025 Serial number: 3
ASMB (ospid: 23724130): terminating the instance due to error 15064
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (22806846) as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (16187426) as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (28049650) as a result of ORA-1092
Tue Oct 28 11:04:18 2014
ORA-1092 : opitsk aborting process
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (7864418) as a result of ORA-1092
Instance terminated by ASMB, pid = 23724130
Tue Oct 28 11:05:45 2014
Starting ORACLE instance (normal)
........
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /oracle/product/11.2.0
System name: AIX
Node name: xxdb2
Release: 1
Version: 7

Tue Oct 28 10:59:40 2014

Thread 2 advanced to log sequence 26516 (LGWR switch)

Current log# 208 seq# 26516 mem# 0: +ORADATA/portaldb/redo208.log

Tue Oct 28 11:04:18 2014

NOTE: ASMB terminating

Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 1025 Serial number: 3

Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc:

ORA-15064: communication failure with ASM instance

ORA-03113: end-of-file on communication channel

Process ID:

Session ID: 1025 Serial number: 3

ASMB (ospid: 23724130): terminating the instance due to error 15064

Tue Oct 28 11:04:18 2014

opiodr aborting process unknown ospid (22806846) as a result of ORA-1092

Tue Oct 28 11:04:18 2014

opiodr aborting process unknown ospid (16187426) as a result of ORA-1092

Tue Oct 28 11:04:18 2014

opiodr aborting process unknown ospid (28049650) as a result of ORA-1092

Tue Oct 28 11:04:18 2014

ORA-1092 : opitsk aborting process

Tue Oct 28 11:04:18 2014

opiodr aborting process unknown ospid (7864418) as a result of ORA-1092

Instance terminated by ASMB, pid = 23724130

Tue Oct 28 11:05:45 2014

Starting ORACLE instance (normal)

........

Starting up:

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining

and Real Application Testing options.

ORACLE_HOME = /oracle/product/11.2.0

System name: AIX

Node name: xxdb2

Release: 1

Version: 7

数据库实例的alert log我们基本上看不到什么有价值的信息，下面我们来看下最为关键的节点2的ocssd.log 日志：

2014-10-28 11:03:58.792: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2014-10-28 11:04:01.217: [    CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 50% heartbeat fatal, removal in 14.773 seconds
2014-10-28 11:04:01.217: [    CSSD][3605]clssnmPollingThread: node xxdb1 (1) is impending reconfig, flag 2491406, misstime 15227
2014-10-28 11:04:01.217: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2014-10-28 11:04:01.217: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932275, LATS 2128362398, lastSeqNo 42873784, uniqueness 1414000248, timestamp 1414465427/2041806941
........
2014-10-28 11:04:13.242: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932297, LATS 2128374423, lastSeqNo 42932295, uniqueness 1414000248, timestamp 1414465439/2041818993
2014-10-28 11:04:13.243: [    CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 90% heartbeat fatal, removal in 2.746 seconds, seedhbimpd 1
2014-10-28 11:04:14.244: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932299, LATS 2128375425, lastSeqNo 42932297, uniqueness 1414000248, timestamp 1414465440/2041819996
2014-10-28 11:04:14.843: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes
2014-10-28 11:04:14.843: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
2014-10-28 11:04:15.246: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932301, LATS 2128376426, lastSeqNo 42932299, uniqueness 1414000248, timestamp 1414465441/2041821006
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmPollingThread: Removal started for node xxdb1 (1), flags 0x26040e, state 3, wt4c 0
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmMarkNodeForRemoval: node 1, xxdb1 marked for removal
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmDiscHelper: xxdb1, node(1) connection failed, endp (90e), probe(0), ninf-&gt;endp 90e
2014-10-28 11:04:15.994: [    CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (90e), init state 5, cur state 5
2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcInternalDissociate: obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]  failed to dissociate obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 }, flags 0x0
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871
2014-10-28 11:04:15.994: [    CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress  val 2, from -1, changes 4
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmSetupAckWait: Ack message type (11) 
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE
2014-10-28 11:04:15.994: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932302, LATS 2128377175, lastSeqNo 42932301, uniqueness 1414000248, timestamp 1414465441/2041821506
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete
.......
2014-10-28 11:04:15.994: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, xxdb1, misstime(30004) state(5)
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info...
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963)
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 2
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
2014-10-28 11:04:15.995: [    CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
2014-10-28 11:04:15.995: [    CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(ocr_xxdb-scan) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0x78, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0xa0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]
----- Call Stack Trace -----
........
2014-10-28 11:04:15.996: [    CSSD][1]clssgmUpdateEventValue: CmInfo State  val 2, changes 13
2014-10-28 11:04:15.996: [    CSSD][1]clssgmUpdateEventValue: ConnectedNodes  val 265564870, changes 5
2014-10-28 11:04:15.996: [    CSSD][1]clssgmCleanupNodeContexts():  cleaning up nodes, rcfg(265564870)
2014-10-28 11:04:15.996: [    CSSD][1]clssgmCleanupNodeContexts():  successful cleanup of nodes rcfg(265564870)
2014-10-28 11:04:15.996: [    CSSD][1]clssgmStartNMMon:  completed node cleanup 
2014-10-28 11:04:15.996: [    CSSD][4119]calling              call     entry                argument values in hex      
2014-10-28 11:04:15.996: [    CSSD][4119]location             type     point                (? means dubious value)     
2014-10-28 11:04:15.996: [    CSSD][3348]clssgmUpdateEventValue: HoldRequest  val 1, changes 3
2014-10-28 11:04:15.996: [    CSSD][4119]-------------------- -------- -------------------- ----------------------------
2014-10-28 11:04:15.997: [    CSSD][4119]clssscExit()+708     bl       10006db74            67800000678 ? 1000F33C8 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   11003ADF0 ? 90000000085C914 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 013A80187 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssnmCheckDskInfo(  call     clssscExit()         111B08E30 ? 200000002 ?
2014-10-28 11:04:15.997: [    CSSD][4119])+1600                                             11003ADF0 ? 90000000085C914 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 013A80187 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssnmDoSyncUpdate(  call     clssnmCheckDskInfo(  111B08E30 ? 110E3F130 ?
2014-10-28 11:04:15.997: [    CSSD][4119])+4016                        )                    000000000 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 013A80187 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000004 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssnmRcfgMgrThread  call     clssnmDoSyncUpdate(  111B08E30 ? 110E3F130 ?
2014-10-28 11:04:15.997: [    CSSD][4119]()+2992                       )                    000000000 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   1128045D0 ? 100000001 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   013A80187 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]clssscthrdmain()+20  call     clssnmRcfgMgrThread  111B08E30 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]8                             ()                   000000000 ? 000000000 ?
2014-10-28 11:04:15.997: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]_pthread_body()+240  call     clssscthrdmain()     111B08E30 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]fffffffffffffffc     call     _pthread_body()      000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119]                                                   000000000 ? 000000000 ?
2014-10-28 11:04:15.998: [    CSSD][4119] 
.......
----- End of Call Stack Trace -----
2014-10-28 11:04:16.052: [    CSSD][4119]clssnmSendMeltdownStatus: node xxdb2, number 2, has experienced a failure in thread number 3 and is shutting down
2014-10-28 11:04:16.052: [    CSSD][4119]clssscExit: Starting CRSD cleanup
2014-10-28 11:04:16.052: [    CSSD][2320]clssnmvDiskKillCheck: not evicted, file /dev/rhdisk22 flags 0x00000000, kill block unique 0, my unique 1414188209
.......
2014-10-28 11:04:16.995: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932304, LATS 2128378175, lastSeqNo 42932302, uniqueness 1414000248, timestamp 1414465442/2041822515
2014-10-28 11:04:17.004: [    CSSD][3605]clssnmPollingThread: state(3) clusterState(2) exit
2014-10-28 11:04:17.004: [    CSSD][3605]clssscExit: abort already set 1
2014-10-28 11:04:17.053: [    CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564871, stamp 2041822753
2014-10-28 11:04:17.053: [    CSSD][2320]clssscExit: abort already set 1
........
2014-10-28 11:04:17.995: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932306, LATS 2128379175, lastSeqNo 42932304, uniqueness 1414000248, timestamp 1414465443/2041823528
........
2014-10-28 11:04:18.600: [    CSSD][4119]clssscExit: CRSD cleanup successfully completed
2014-10-28 11:04:18.602: [ default][4119]kgzf_gen_node_reid2: generated reid cid=2cca484453d04f96bf8ea33b240c4c97,icin=265564851,nmn=2,lnid=265564870,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100
2014-10-28 11:04:18.602: [    CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
2014-10-28 11:04:18.602: [    CSSD][4119]clssgmUpdateEventValue: CmInfo State  val 0, changes 14
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmProcClientReqs: Checking RPC Q
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmProcClientReqs: Checking dead client Q
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmProcClientReqs: Checking dead proc Q
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmSendShutdown: Aborting client (1114fb0f0) proc (111dceb10), iocapables 1.
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmSendShutdown: I/O capable proc (111dceb10), pid (19595302), iocapables 1, client (1114fb0f0)
2014-10-28 11:04:18.602: [    CSSD][1029]clssgmSendShutdown: Aborting client (11152a530) proc (111d29b50), iocapables 2.

100

101

102

103

104

105

106

107

108

109

110

111

112

2014-10-28 11:03:58.792: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2014-10-28 11:04:01.217: [ CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 50% heartbeat fatal, removal in 14.773 seconds

2014-10-28 11:04:01.217: [ CSSD][3605]clssnmPollingThread: node xxdb1 (1) is impending reconfig, flag 2491406, misstime 15227

2014-10-28 11:04:01.217: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)

2014-10-28 11:04:01.217: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932275, LATS 2128362398, lastSeqNo 42873784, uniqueness 1414000248, timestamp 1414465427/2041806941

........

2014-10-28 11:04:13.242: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932297, LATS 2128374423, lastSeqNo 42932295, uniqueness 1414000248, timestamp 1414465439/2041818993

2014-10-28 11:04:13.243: [ CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 90% heartbeat fatal, removal in 2.746 seconds, seedhbimpd 1

2014-10-28 11:04:14.244: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932299, LATS 2128375425, lastSeqNo 42932297, uniqueness 1414000248, timestamp 1414465440/2041819996

2014-10-28 11:04:14.843: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes

2014-10-28 11:04:14.843: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes

2014-10-28 11:04:15.246: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932301, LATS 2128376426, lastSeqNo 42932299, uniqueness 1414000248, timestamp 1414465441/2041821006

2014-10-28 11:04:15.994: [ CSSD][3605]clssnmPollingThread: Removal started for node xxdb1 (1), flags 0x26040e, state 3, wt4c 0

2014-10-28 11:04:15.994: [ CSSD][3605]clssnmMarkNodeForRemoval: node 1, xxdb1 marked for removal

2014-10-28 11:04:15.994: [ CSSD][3605]clssnmDiscHelper: xxdb1, node(1) connection failed, endp (90e), probe(0), ninf->endp 90e

2014-10-28 11:04:15.994: [ CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (90e), init state 5, cur state 5

2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcInternalDissociate: obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)

2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 }, flags 0x0

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871

2014-10-28 11:04:15.994: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 2, from -1, changes 4

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11)

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE

2014-10-28 11:04:15.994: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932302, LATS 2128377175, lastSeqNo 42932301, uniqueness 1414000248, timestamp 1414465441/2041821506

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete

.......

2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, xxdb1, misstime(30004) state(5)

2014-10-28 11:04:15.995: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)

2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info...

2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963)

2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0

2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2

2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1

2014-10-28 11:04:15.995: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2

2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0

2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0

2014-10-28 11:04:15.995: [ CSSD][4119]###################################

2014-10-28 11:04:15.995: [ CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

2014-10-28 11:04:15.995: [ CSSD][4119]###################################

2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(ocr_xxdb-scan) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0x78, state 0x0

2014-10-28 11:04:15.995: [ CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally

2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0xa0, state 0x0

2014-10-28 11:04:15.995: [ CSSD][4119]

----- Call Stack Trace -----

........

2014-10-28 11:04:15.996: [ CSSD][1]clssgmUpdateEventValue: CmInfo State val 2, changes 13

2014-10-28 11:04:15.996: [ CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 265564870, changes 5

2014-10-28 11:04:15.996: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(265564870)

2014-10-28 11:04:15.996: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(265564870)

2014-10-28 11:04:15.996: [ CSSD][1]clssgmStartNMMon: completed node cleanup

2014-10-28 11:04:15.996: [ CSSD][4119]calling call entry argument values in hex

2014-10-28 11:04:15.996: [ CSSD][4119]location type point (? means dubious value)

2014-10-28 11:04:15.996: [ CSSD][3348]clssgmUpdateEventValue: HoldRequest val 1, changes 3

2014-10-28 11:04:15.996: [ CSSD][4119]-------------------- -------- -------------------- ----------------------------

2014-10-28 11:04:15.997: [ CSSD][4119]clssscExit()+708 bl 10006db74 67800000678 ? 1000F33C8 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 11003ADF0 ? 90000000085C914 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 013A80187 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119]clssnmCheckDskInfo( call clssscExit() 111B08E30 ? 200000002 ?

2014-10-28 11:04:15.997: [ CSSD][4119])+1600 11003ADF0 ? 90000000085C914 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 013A80187 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119]clssnmDoSyncUpdate( call clssnmCheckDskInfo( 111B08E30 ? 110E3F130 ?

2014-10-28 11:04:15.997: [ CSSD][4119])+4016 ) 000000000 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 013A80187 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000004 ?

2014-10-28 11:04:15.997: [ CSSD][4119]clssnmRcfgMgrThread call clssnmDoSyncUpdate( 111B08E30 ? 110E3F130 ?

2014-10-28 11:04:15.997: [ CSSD][4119]()+2992 ) 000000000 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 1128045D0 ? 100000001 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119]clssscthrdmain()+20 call clssnmRcfgMgrThread 111B08E30 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119]8 () 000000000 ? 000000000 ?

2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 000000000 ?

2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ?

2014-10-28 11:04:15.998: [ CSSD][4119]_pthread_body()+240 call clssscthrdmain() 111B08E30 ? 000000000 ?

2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ?

2014-10-28 11:04:15.998: [ CSSD][4119]fffffffffffffffc call _pthread_body() 000000000 ? 000000000 ?

2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ?

2014-10-28 11:04:15.998: [ CSSD][4119]

.......

----- End of Call Stack Trace -----

2014-10-28 11:04:16.052: [ CSSD][4119]clssnmSendMeltdownStatus: node xxdb2, number 2, has experienced a failure in thread number 3 and is shutting down

2014-10-28 11:04:16.052: [ CSSD][4119]clssscExit: Starting CRSD cleanup

2014-10-28 11:04:16.052: [ CSSD][2320]clssnmvDiskKillCheck: not evicted, file /dev/rhdisk22 flags 0x00000000, kill block unique 0, my unique 1414188209

.......

2014-10-28 11:04:16.995: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932304, LATS 2128378175, lastSeqNo 42932302, uniqueness 1414000248, timestamp 1414465442/2041822515

2014-10-28 11:04:17.004: [ CSSD][3605]clssnmPollingThread: state(3) clusterState(2) exit

2014-10-28 11:04:17.004: [ CSSD][3605]clssscExit: abort already set 1

2014-10-28 11:04:17.053: [ CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564871, stamp 2041822753

2014-10-28 11:04:17.053: [ CSSD][2320]clssscExit: abort already set 1

........

2014-10-28 11:04:17.995: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932306, LATS 2128379175, lastSeqNo 42932304, uniqueness 1414000248, timestamp 1414465443/2041823528

........

2014-10-28 11:04:18.600: [ CSSD][4119]clssscExit: CRSD cleanup successfully completed

2014-10-28 11:04:18.602: [ default][4119]kgzf_gen_node_reid2: generated reid cid=2cca484453d04f96bf8ea33b240c4c97,icin=265564851,nmn=2,lnid=265564870,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100

2014-10-28 11:04:18.602: [ CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0

2014-10-28 11:04:18.602: [ CSSD][4119]clssgmUpdateEventValue: CmInfo State val 0, changes 14

2014-10-28 11:04:18.602: [ CSSD][1029]clssgmProcClientReqs: Checking RPC Q

2014-10-28 11:04:18.602: [ CSSD][1029]clssgmProcClientReqs: Checking dead client Q

2014-10-28 11:04:18.602: [ CSSD][1029]clssgmProcClientReqs: Checking dead proc Q

2014-10-28 11:04:18.602: [ CSSD][1029]clssgmSendShutdown: Aborting client (1114fb0f0) proc (111dceb10), iocapables 1.

2014-10-28 11:04:18.602: [ CSSD][1029]clssgmSendShutdown: I/O capable proc (111dceb10), pid (19595302), iocapables 1, client (1114fb0f0)

2014-10-28 11:04:18.602: [ CSSD][1029]clssgmSendShutdown: Aborting client (11152a530) proc (111d29b50), iocapables 2.

我们过滤掉节点2的ocssd中的关键信息，可以发现如下的内容：

25号:

2014-10-25 06:03:07.804: [    CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564869, wrtcnt, 42656043, LATS 1851108984, lastSeqNo 42656042, uniqueness 1414000248, timestamp 1414188173/1764553533
.......
2014-10-25 06:03:07.804: [    CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564869, stamp 1764553706
2014-10-25 06:03:07.804: [    CSSD][2320]###################################
2014-10-25 06:03:07.804: [    CSSD][4376]clssnmHandleSync: Node xxdb2, number 2, is EXADATA fence capable
2014-10-25 06:03:07.804: [    CSSD][2320]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2014-10-25 06:03:07.804: [    CSSD][4119]clssnmSendSync: syncSeqNo(265564869)
2014-10-25 06:03:07.804: [    CSSD][2320]###################################

2014-10-25 06:03:07.804: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564869, wrtcnt, 42656043, LATS 1851108984, lastSeqNo 42656042, uniqueness 1414000248, timestamp 1414188173/1764553533

.......

2014-10-25 06:03:07.804: [ CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564869, stamp 1764553706

2014-10-25 06:03:07.804: [ CSSD][2320]###################################

2014-10-25 06:03:07.804: [ CSSD][4376]clssnmHandleSync: Node xxdb2, number 2, is EXADATA fence capable

2014-10-25 06:03:07.804: [ CSSD][2320]clssscExit: CSSD aborting from thread clssnmvKillBlockThread

2014-10-25 06:03:07.804: [ CSSD][4119]clssnmSendSync: syncSeqNo(265564869)

2014-10-25 06:03:07.804: [ CSSD][2320]###################################

28号：

2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963)
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 2
2014-10-28 11:04:15.995: [    CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
2014-10-28 11:04:15.995: [    CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
2014-10-28 11:04:15.995: [    CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
2014-10-28 11:04:15.995: [    CSSD][4119]###################################
2014-10-28 11:04:15.995: [    CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2014-10-28 11:04:15.995: [    CSSD][4119]###################################