rac节点频繁重启的案例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: rac节点频繁重启的案例
这是一个网友的信息,其RAC频繁重启,从9月份以来几乎每间隔2天就要重启,我们先来看看日志。
节点1的alert log:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Tue Oct 28 10:51:40 2014 Thread 1 advanced to log sequence 22792 (LGWR switch) Current log# 107 seq# 22792 mem# 0: +ORADATA/portaldb/redo107.log Tue Oct 28 10:57:16 2014 Thread 1 advanced to log sequence 22793 (LGWR switch) Current log# 108 seq# 22793 mem# 0: +ORADATA/portaldb/redo108.log Tue Oct 28 11:04:07 2014 Reconfiguration started (old inc 48, new inc 50) List of instances: 1 (myinst: 1) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Tue Oct 28 11:04:09 2014 Tue Oct 28 11:04:09 2014 |
我们接着来看下节点1的crs的alert log信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
2014-10-23 07:19:57.145 [cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.732 seconds 2014-10-23 07:20:05.169 [cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.708 seconds 2014-10-23 07:20:09.175 [cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.702 seconds 2014-10-23 07:20:11.880 [cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564863; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log. ........ 2014-10-23 09:58:11.620 [cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.141 seconds 2014-10-23 09:58:18.634 [cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.126 seconds 2014-10-23 09:58:23.660 [cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.100 seconds 2014-10-23 09:58:25.763 [cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564865; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log. ........ 2014-10-23 14:31:07.140 [cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.105 seconds 2014-10-23 14:31:14.169 [cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.075 seconds 2014-10-23 14:31:19.181 [cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.063 seconds 2014-10-23 14:31:21.246 [cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564867; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log. ........ 2014-10-25 06:02:39.191 [cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.748 seconds 2014-10-25 06:02:47.197 [cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.742 seconds 2014-10-25 06:02:51.203 [cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.736 seconds 2014-10-25 06:02:53.941 [cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564869; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log. ........ 2014-10-25 06:04:02.765 [crsd(6815946)]CRS-2772:Server 'xxdb2' has been assigned to pool 'ora.portaldb'. 2014-10-28 11:03:48.965 [cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.023 seconds 2014-10-28 11:03:55.990 [cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval. Removal of this node from cluster in 6.998 seconds 2014-10-28 11:04:00.008 [cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.979 seconds 2014-10-28 11:04:02.988 [cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation 265564871; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log. 2014-10-28 11:04:05.992 [cssd(6095264)]CRS-1625:Node xxdb2, number 2, was manually shut down 2014-10-28 11:04:05.998 ........ [cssd(6095264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxdb1 xxdb2 . |
从节点1的crs alert log来看,23号,25号以及28号都出现了节点驱逐的情况。单从上面的信息来看,似乎跟网络有关系。
节点1的ocssd.log 如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
2014-10-28 11:03:47.010: [ CSSD][1029]clssscSelect: cookie accept request 110987400 2014-10-28 11:03:47.010: [ CSSD][1029]clssgmAllocProc: (113c7b590) allocated 2014-10-28 11:03:47.016: [ CSSD][1029]clssgmClientConnectMsg: properties of cmProc 113c7b590 - 0,1,2,3,4 2014-10-28 11:03:47.016: [ CSSD][1029]clssgmClientConnectMsg: Connect from con(1d287ee) proc(113c7b590) pid(26345486) version 11:2:1:4, properties: 0,1,2,3,4 2014-10-28 11:03:47.016: [ CSSD][1029]clssgmClientConnectMsg: msg flags 0x0000 2014-10-28 11:03:47.061: [ CSSD][2577]clssnmSetupReadLease: status 1 ........ 2014-10-28 11:03:48.965: [ CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 50% heartbeat fatal, removal in 14.023 seconds 2014-10-28 11:03:48.965: [ CSSD][3605]clssnmPollingThread: node xxdb2 (2) is impending reconfig, flag 2294796, misstime 15977 2014-10-28 11:03:48.965: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1) 2014-10-28 11:03:48.965: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917536, LATS 2041808730, lastSeqNo 41914523, uniqueness 1414188209, timestamp 1414465442/2128363897 ...... 2014-10-28 11:03:49.611: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917537, LATS 2041809376, lastSeqNo 41917536, uniqueness 1414188209, timestamp 1414465443/2128364399 2014-10-28 11:03:49.612: [ CSSD][2577]clssnmSetupReadLease: status 1 2014-10-28 11:03:49.617: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1 2014-10-28 11:03:49.617: [ CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1 2014-10-28 11:03:49.619: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (113c9aad0), operation status 1, client status 0 2014-10-28 11:03:49.633: [ CSSD][2577]clssnmCompleteGMReq: Completed request type 18 with status 1 2014-10-28 11:03:49.633: [ CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1 2014-10-28 11:03:49.635: [ CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 18 for proc (113c9aad0), operation status 1, client status 0 2014-10-28 11:03:49.671: [ CSSD][1029]clssnmGetNodeNumber: xxdb1 2014-10-28 11:03:49.725: [ CSSD][1029]clssnmGetNodeNumber: xxdb2 ....... 2014-10-28 11:03:49.969: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917538, LATS 2041809734, lastSeqNo 41917537, uniqueness 1414188209, timestamp 1414465443/2128364901 ....... 2014-10-28 11:03:50.970: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917540, LATS 2041810735, lastSeqNo 41917538, uniqueness 1414188209, timestamp 1414465444/2128365902 ....... 2014-10-28 11:03:51.248: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes 2014-10-28 11:03:51.248: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes 2014-10-28 11:03:51.975: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917542, LATS 2041811740, lastSeqNo 41917540, uniqueness 1414188209, timestamp 1414465445/2128366906 ....... 2014-10-28 11:04:00.007: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917558, LATS 2041819771, lastSeqNo 41917556, uniqueness 1414188209, timestamp 1414465453/2128374949 2014-10-28 11:04:00.008: [ CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 90% heartbeat fatal, removal in 2.979 seconds, seedhbimpd 1 2014-10-28 11:04:01.010: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917560, LATS 2041820775, lastSeqNo 41917558, uniqueness 1414188209, timestamp 1414465454/2128375951 2014-10-28 11:04:02.012: [ CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 41917562, LATS 2041821776, lastSeqNo 41917560, uniqueness 1414188209, timestamp 1414465455/2128376952 2014-10-28 11:04:02.988: [ CSSD][3605]clssnmPollingThread: Removal started for node xxdb2 (2), flags 0x23040c, state 3, wt4c 0 2014-10-28 11:04:02.988: [ CSSD][3605]clssnmMarkNodeForRemoval: node 2, xxdb2 marked for removal 2014-10-28 11:04:02.988: [ CSSD][3605]clssnmDiscHelper: xxdb2, node(2) connection failed, endp (c1da79), probe(0), ninf->endp c1da79 2014-10-28 11:04:02.988: [ CSSD][3605]clssnmDiscHelper: node 2 clean up, endp (c1da79), init state 5, cur state 5 2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcInternalDissociate: obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1) 2014-10-28 11:04:02.988: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 11299f410 [0000000000c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871 2014-10-28 11:04:02.988: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 1, from -1, changes 61 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed. 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11) 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete ........ 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmNeedConfReq: No configuration to change 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, xxdb2, misstime(30001) state(5) 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s) 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info... 2014-10-28 11:04:02.988: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 1, events 0xa0, state 0x0 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmCheckSplit: Node 2, xxdb2, is alive, DHB (1414465455, 2128376952) more than disk timeout of 27000 after the last NHB (1414465426, 2128347958) 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 1 2014-10-28 11:04:02.988: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 0, events 0x20, state 0x0 2014-10-28 11:04:02.988: [ CSSD][4119]clssnmRemove: Start 2014-10-28 11:04:02.988: [ CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, xxdb2, from the cluster in incarnation 265564871, node birth incarnation 265564870, death incarnation 265564871, stateflags 0x234000 uniqueness value 1414188209 ........ 2014-10-28 11:04:02.989: [ CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0 2014-10-28 11:04:02.989: [ CSSD][4119]clssnmSendShutdown: req to node 2, kill time 2041822753 2014-10-28 11:04:02.989: [ CSSD][4119]clssnmsendmsg: not connected to node 2 |
可以看到28号,oracle的clssnmPollingThread函数调用出现异常。oracle本质上是通过调用该函数来判断集群节点的心跳是否正常。
从上面的信息来看,似乎提示的是节点2没有网络心跳。
下面的重点是分析节点2的日志了,首先来看下节点2的alert log:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
Tue Oct 28 10:59:40 2014 Thread 2 advanced to log sequence 26516 (LGWR switch) Current log# 208 seq# 26516 mem# 0: +ORADATA/portaldb/redo208.log Tue Oct 28 11:04:18 2014 NOTE: ASMB terminating Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 1025 Serial number: 3 Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_23724130.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 1025 Serial number: 3 ASMB (ospid: 23724130): terminating the instance due to error 15064 Tue Oct 28 11:04:18 2014 opiodr aborting process unknown ospid (22806846) as a result of ORA-1092 Tue Oct 28 11:04:18 2014 opiodr aborting process unknown ospid (16187426) as a result of ORA-1092 Tue Oct 28 11:04:18 2014 opiodr aborting process unknown ospid (28049650) as a result of ORA-1092 Tue Oct 28 11:04:18 2014 ORA-1092 : opitsk aborting process Tue Oct 28 11:04:18 2014 opiodr aborting process unknown ospid (7864418) as a result of ORA-1092 Instance terminated by ASMB, pid = 23724130 Tue Oct 28 11:05:45 2014 Starting ORACLE instance (normal) ........ Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options. ORACLE_HOME = /oracle/product/11.2.0 System name: AIX Node name: xxdb2 Release: 1 Version: 7 |
数据库实例的alert log我们基本上看不到什么有价值的信息,下面我们来看下最为关键的节点2的ocssd.log 日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
2014-10-28 11:03:58.792: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes 2014-10-28 11:04:01.217: [ CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 50% heartbeat fatal, removal in 14.773 seconds 2014-10-28 11:04:01.217: [ CSSD][3605]clssnmPollingThread: node xxdb1 (1) is impending reconfig, flag 2491406, misstime 15227 2014-10-28 11:04:01.217: [ CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1) 2014-10-28 11:04:01.217: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932275, LATS 2128362398, lastSeqNo 42873784, uniqueness 1414000248, timestamp 1414465427/2041806941 ........ 2014-10-28 11:04:13.242: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932297, LATS 2128374423, lastSeqNo 42932295, uniqueness 1414000248, timestamp 1414465439/2041818993 2014-10-28 11:04:13.243: [ CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 90% heartbeat fatal, removal in 2.746 seconds, seedhbimpd 1 2014-10-28 11:04:14.244: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932299, LATS 2128375425, lastSeqNo 42932297, uniqueness 1414000248, timestamp 1414465440/2041819996 2014-10-28 11:04:14.843: [ CSSD][3862]clssnmSendingThread: sending status msg to all nodes 2014-10-28 11:04:14.843: [ CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes 2014-10-28 11:04:15.246: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932301, LATS 2128376426, lastSeqNo 42932299, uniqueness 1414000248, timestamp 1414465441/2041821006 2014-10-28 11:04:15.994: [ CSSD][3605]clssnmPollingThread: Removal started for node xxdb1 (1), flags 0x26040e, state 3, wt4c 0 2014-10-28 11:04:15.994: [ CSSD][3605]clssnmMarkNodeForRemoval: node 1, xxdb1 marked for removal 2014-10-28 11:04:15.994: [ CSSD][3605]clssnmDiscHelper: xxdb1, node(1) connection failed, endp (90e), probe(0), ninf->endp 90e 2014-10-28 11:04:15.994: [ CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (90e), init state 5, cur state 5 2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcInternalDissociate: obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1) 2014-10-28 11:04:15.994: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 111261650 [000000000000090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 }, flags 0x0 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: Initiating sync 265564871 2014-10-28 11:04:15.994: [ CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress val 2, from -1, changes 4 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed. 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation 265564871 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmSetupAckWait: Ack message type (11) 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE 2014-10-28 11:04:15.994: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932302, LATS 2128377175, lastSeqNo 42932301, uniqueness 1414000248, timestamp 1414465441/2041821506 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmSendSync: syncSeqNo(265564871), indicating EXADATA fence initialization complete ....... 2014-10-28 11:04:15.994: [ CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, xxdb1, misstime(30004) state(5) 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s) 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: Checking disk info... 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963) 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1 2014-10-28 11:04:15.995: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0 2014-10-28 11:04:15.995: [ CSSD][4119]################################### 2014-10-28 11:04:15.995: [ CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread 2014-10-28 11:04:15.995: [ CSSD][4119]################################### 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(ocr_xxdb-scan) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0x78, state 0x0 2014-10-28 11:04:15.995: [ CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0xa0, state 0x0 2014-10-28 11:04:15.995: [ CSSD][4119] ----- Call Stack Trace ----- ........ 2014-10-28 11:04:15.996: [ CSSD][1]clssgmUpdateEventValue: CmInfo State val 2, changes 13 2014-10-28 11:04:15.996: [ CSSD][1]clssgmUpdateEventValue: ConnectedNodes val 265564870, changes 5 2014-10-28 11:04:15.996: [ CSSD][1]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(265564870) 2014-10-28 11:04:15.996: [ CSSD][1]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(265564870) 2014-10-28 11:04:15.996: [ CSSD][1]clssgmStartNMMon: completed node cleanup 2014-10-28 11:04:15.996: [ CSSD][4119]calling call entry argument values in hex 2014-10-28 11:04:15.996: [ CSSD][4119]location type point (? means dubious value) 2014-10-28 11:04:15.996: [ CSSD][3348]clssgmUpdateEventValue: HoldRequest val 1, changes 3 2014-10-28 11:04:15.996: [ CSSD][4119]-------------------- -------- -------------------- ---------------------------- 2014-10-28 11:04:15.997: [ CSSD][4119]clssscExit()+708 bl 10006db74 67800000678 ? 1000F33C8 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 11003ADF0 ? 90000000085C914 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 013A80187 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119]clssnmCheckDskInfo( call clssscExit() 111B08E30 ? 200000002 ? 2014-10-28 11:04:15.997: [ CSSD][4119])+1600 11003ADF0 ? 90000000085C914 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 013A80187 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119]clssnmDoSyncUpdate( call clssnmCheckDskInfo( 111B08E30 ? 110E3F130 ? 2014-10-28 11:04:15.997: [ CSSD][4119])+4016 ) 000000000 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 013A80187 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000004 ? 2014-10-28 11:04:15.997: [ CSSD][4119]clssnmRcfgMgrThread call clssnmDoSyncUpdate( 111B08E30 ? 110E3F130 ? 2014-10-28 11:04:15.997: [ CSSD][4119]()+2992 ) 000000000 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 1128045D0 ? 100000001 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 013A80187 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119]clssscthrdmain()+20 call clssnmRcfgMgrThread 111B08E30 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119]8 () 000000000 ? 000000000 ? 2014-10-28 11:04:15.997: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119]_pthread_body()+240 call clssscthrdmain() 111B08E30 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119]fffffffffffffffc call _pthread_body() 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] 000000000 ? 000000000 ? 2014-10-28 11:04:15.998: [ CSSD][4119] ....... ----- End of Call Stack Trace ----- 2014-10-28 11:04:16.052: [ CSSD][4119]clssnmSendMeltdownStatus: node xxdb2, number 2, has experienced a failure in thread number 3 and is shutting down 2014-10-28 11:04:16.052: [ CSSD][4119]clssscExit: Starting CRSD cleanup 2014-10-28 11:04:16.052: [ CSSD][2320]clssnmvDiskKillCheck: not evicted, file /dev/rhdisk22 flags 0x00000000, kill block unique 0, my unique 1414188209 ....... 2014-10-28 11:04:16.995: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932304, LATS 2128378175, lastSeqNo 42932302, uniqueness 1414000248, timestamp 1414465442/2041822515 2014-10-28 11:04:17.004: [ CSSD][3605]clssnmPollingThread: state(3) clusterState(2) exit 2014-10-28 11:04:17.004: [ CSSD][3605]clssscExit: abort already set 1 2014-10-28 11:04:17.053: [ CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564871, stamp 2041822753 2014-10-28 11:04:17.053: [ CSSD][2320]clssscExit: abort already set 1 ........ 2014-10-28 11:04:17.995: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564871, wrtcnt, 42932306, LATS 2128379175, lastSeqNo 42932304, uniqueness 1414000248, timestamp 1414465443/2041823528 ........ 2014-10-28 11:04:18.600: [ CSSD][4119]clssscExit: CRSD cleanup successfully completed 2014-10-28 11:04:18.602: [ default][4119]kgzf_gen_node_reid2: generated reid cid=2cca484453d04f96bf8ea33b240c4c97,icin=265564851,nmn=2,lnid=265564870,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100 2014-10-28 11:04:18.602: [ CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0 2014-10-28 11:04:18.602: [ CSSD][4119]clssgmUpdateEventValue: CmInfo State val 0, changes 14 2014-10-28 11:04:18.602: [ CSSD][1029]clssgmProcClientReqs: Checking RPC Q 2014-10-28 11:04:18.602: [ CSSD][1029]clssgmProcClientReqs: Checking dead client Q 2014-10-28 11:04:18.602: [ CSSD][1029]clssgmProcClientReqs: Checking dead proc Q 2014-10-28 11:04:18.602: [ CSSD][1029]clssgmSendShutdown: Aborting client (1114fb0f0) proc (111dceb10), iocapables 1. 2014-10-28 11:04:18.602: [ CSSD][1029]clssgmSendShutdown: I/O capable proc (111dceb10), pid (19595302), iocapables 1, client (1114fb0f0) 2014-10-28 11:04:18.602: [ CSSD][1029]clssgmSendShutdown: Aborting client (11152a530) proc (111d29b50), iocapables 2. |
我们过滤掉节点2的ocssd中的关键信息,可以发现如下的内容:
25号:
1 2 3 4 5 6 7 8 |
2014-10-25 06:03:07.804: [ CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg 265564869, wrtcnt, 42656043, LATS 1851108984, lastSeqNo 42656042, uniqueness 1414000248, timestamp 1414188173/1764553533 ....... 2014-10-25 06:03:07.804: [ CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync 265564869, stamp 1764553706 2014-10-25 06:03:07.804: [ CSSD][2320]################################### 2014-10-25 06:03:07.804: [ CSSD][4376]clssnmHandleSync: Node xxdb2, number 2, is EXADATA fence capable 2014-10-25 06:03:07.804: [ CSSD][2320]clssscExit: CSSD aborting from thread clssnmvKillBlockThread 2014-10-25 06:03:07.804: [ CSSD][4119]clssnmSendSync: syncSeqNo(265564869) 2014-10-25 06:03:07.804: [ CSSD][2320]################################### |
28号:
1 2 3 4 5 6 7 8 9 10 |
2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (1414465441, 2041821506) more than disk timeout of 27000 after the last NHB (1414465412, 2041791963) 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: My cohort: 2 2014-10-28 11:04:15.995: [ CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1 2014-10-28 11:04:15.995: [ CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0 2014-10-28 11:04:15.995: [ CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0 2014-10-28 11:04:15.995: [ CSSD][4119]################################### 2014-10-28 11:04:15.995: [ CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread 2014-10-28 11:04:15.995: [ CSSD][4119]################################### |
单纯的从25号和28号的数据来看,这2次故障其实是不同的。我们上面的关键信息我们可以发现,25号的原cssd异常是调用clssnmvKillBlockThread出现问题,而
28号是clssnmRcfgMgrThread。
显然,这2个函数是完全不同的类型,第一个函数的针对votedisk的操作,而第2个函数是网络相关的函数.
最后问这哥们最近做过什么变动,据说是换了交换机之后就出现这个现象了。
跟大家分享这个小的案例!
One Response to “rac节点频繁重启的案例”
Roger大神就是牛,请问这个case,大神如何解决的呢?调整网络?还是其他高招?
Leave a Reply
You must be logged in to post a comment.