9i rac 脑裂导致节点重启
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 9i rac 脑裂导致节点重启
1 2 3 |
同事今天发过来一个 9i rac 某节点重启case的一些trace,大概的分析了一下,如下: 首先我们来看1节点的alert log,如下: |
1 2 3 4 5 6 7 8 9 10 |
Tue Oct 18 11:03:00 2011 Timed out trying to start process P001. Tue Oct 18 11:06:34 2011 Evicting instance 2 from cluster Tue Oct 18 11:06:38 2011 Reconfiguration started (old inc 18, new inc 19) List of nodes: 0 Tue Oct 18 11:06:38 2011 Reconfiguration started (old inc 18, new inc 20) |
1 |
我们可以看到在11:06:34 cluster开始驱逐2节点。下面我们来看2节点的信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
Tue Oct 18 10:51:57 2011 IPC Send timeout detected. Sender ospid 22239 Tue Oct 18 10:51:59 2011 Communications reconfiguration: instance 0 Tue Oct 18 10:52:03 2011 ARC0: Completed archiving log 6 thread 2 sequence 87953 Tue Oct 18 10:53:45 2011 Waiting for clusterware split-brain resolution Tue Oct 18 11:02:14 2011 Timed out trying to start process P001. Tue Oct 18 11:03:45 2011 USER: terminating instance due to error 481 Tue Oct 18 11:03:45 2011 Errors in file /orarac/app/oracle/admin/siebdb/bdump/siebdb2_lgwr_15116.trc: ORA-00481: LMON process terminated with error Tue Oct 18 11:03:45 2011 Tue Oct 18 11:07:56 2011 Reconfiguration started (old inc 0, new inc 21) List of nodes: 0 1 Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Resources and enqueues cleaned out Resources remastered 0 0 GCS shadows traversed, 0 cancelled, 0 closed 0 GCS resources traversed, 0 cancelled set master node info Submitted all remote-enqueue requests Update rdomain variables Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted 0 GCS shadows traversed, 0 replayed, 0 unopened Submitted all GCS remote-cache requests Tue Oct 18 11:08:19 2011 0 write requests issued in 1511603 GCS resources 436 PIs marked suspect, 0 flush PI msgs Tue Oct 18 11:08:22 2011 Reconfiguration complete |
1 2 3 4 5 6 7 8 9 10 11 |
从2节点的alert 信息可以得到如下一个时间关系: 10:51:57 --> IPC Send 超时 10:51:59 --> 通信失败 cluster尝试关闭实例2 (instance 0 其中0表示关闭) 10:53:45 --> 出现脑裂 11:02:14 --> 尝试启动p001进程(从这里推测,前面的22239应该是P001进程) 11:03:45 --> 实例终止,随后出现ORA-00481错误 从前面的分析来看,可以断定是脑裂导致实例2被crash,那么是什么原因导致脑裂的呢? 最后我们来看alert中提到的lmon 进程的trace信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
*** SESSION ID:(3.1) 2011-10-18 07:08:57.162 GES IPC: Receivers 10 Senders 10 GES IPC: Buffers Receive 1000 Send (i:6830 b:6830) Reserve 1000 GES IPC: Msg Size Regular 440 Batch 2048 Batch msg size = 2048 Batching factor: enqueue replay 47, ack 52 Batching factor: cache replay 29 size per lock 64 kjxggin: receive buffer size = 32768 kjxgmin: SKGXN ver (2 0 Hewlett-Packard SKGXN 2.0) *** 2011-10-18 07:09:01.648 kjxgmrcfg: Reconfiguration started, reason 1 kjxgmcs: Setting state to 0 0. *** 2011-10-18 07:09:01.649 Name Service frozen kjxgmcs: Setting state to 0 1. kjfcpiora: publish my weight 3122016 kjxgmps: proposing substate 2 kjxgmcs: Setting state to 18 2. Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting state to 18 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing substate 4 kjxgmcs: Setting state to 18 4. Multicasted all local name entries for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 18 5. Name Service normal Name Service recovery done *** 2011-10-18 07:09:02.652 kjxgmps: proposing substate 6 kjxgmcs: Setting state to 18 6. *** 2011-10-18 07:09:02.652 *** 2011-10-18 07:09:02.653 Reconfiguration started (old inc 0, new inc 18) Synchronization timeout interval: 660 sec ====> 同步超时时间660s List of nodes: 0 1 ====> cluster node编号 Global Resource Directory frozen node 0 release 9 2 0 7 ====> oracle version node 1 release 9 2 0 7 res_master_weight for node 0 is 3122016 res_master_weight for node 1 is 3122016 Total master weight = 6244032 Dead inst Join inst 0 1 Exist inst Active Sendback Threshold = 50 % Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Resources and enqueues cleaned out Resources remastered 0 0 GCS shadows traversed, 0 cancelled, 0 closed 0 GCS resources traversed, 0 cancelled set master node info Submitted all remote-enqueue requests kjfcrfg: Number of mesgs sent to node 0 = 0 Update rdomain variables Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted *** 2011-10-18 07:09:11.515 0 GCS shadows traversed, 0 replayed, 0 unopened Submitted all GCS cache requests 0 write requests issued in 1518892 GCS resources 10 PIs marked suspect, 0 flush PI msgs *** 2011-10-18 07:09:27.724 Reconfiguration complete *** 2011-10-18 07:09:35.232 kjxgrtmc2: Member 1 thread 2 mounted *** 2011-10-18 10:51:59.470 ====> 这个时间点很重要 kjxgrcomerr: Communications reconfig: instance 0 (18,18) Submitting asynchronized dump request [2] kjxgrrcfgchk: Initiating reconfig, reason 3 ====> 通信失败 *** 2011-10-18 10:52:04.499 kjxgmrcfg: Reconfiguration started, reason 3 kjxgmcs: Setting state to 18 0. *** 2011-10-18 10:52:04.504 Name Service frozen kjxgmcs: Setting state to 18 1. *** 2011-10-18 10:52:04.514 Obtained RR update lock for sequence 18, RR seq 18 *** 2011-10-18 10:53:45.311 ====> 出现脑裂 kjxgrrecp2: Waiting for split-brain resolution, upd 1, seq 19 *** 2011-10-18 11:03:45.315 Voting results, upd 1, seq 19, bitmap: 0 *** 2011-10-18 11:03:45.315 kjxgrdtrt: Evicted by 1, seq (19, 19) IMR state information ====> 触发IMR Member 1, thread 2, state 4, flags 0x00a1 RR seq 19, propstate 3, pending propstate 0 rcfg rsn 3, rcfg time 21546700, mem ct 2 master 1, master rcfg time 21546700 Member information: Member 0, incarn 18, version 1840361946 thrd 1, prev thrd 65535, status 0x0007, err 0x0002 valct 0 Member 1, incarn 18, version 816352 thrd 2, prev thrd 65535, status 0x0107, err 0x0000 valct 2 Group name: SIEBDB Member id: 1 Cached SKGXN event: 0 Group State: State: 18 1 Commited Map: 0 1 New Map: 0 1 SKGXN Map: 0 1 Master node: 0 Memcnt 2 Rcvcnt 0 Substate Proposal: false Inc Proposal: incarn 0 memcnt 0 master 0 proposal false matched false map: Master Inc State: incarn 0 memcnt 0 agrees 0 flag 0x1 wmap: nmap: ubmap: Submitting asynchronized dump request [1] error 29740 detected in background process ORA-29740: evicted by member 1, group incarnation 19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
从整个时间点来看,我们可以发现整个触发脑裂到脑裂处理完成的时间超不多15分钟。 大约时间点是这样:10:51 - 11:07 关于reconfigurtion reason 有如下几种解释: Reason 0 = No reconfiguration Reason 1 = The Node Monitor generated the reconfiguration. Reason 2 = An instance death was detected. Reason 3 = Communications Failure Reason 4 = Reconfiguration after suspend 如果说前面那个660s就是心跳超时时间的阈值的话,那么就不会出现这个故障。 据同事说,该rac cluster节点时间相差3分钟,在当前凌晨,硬件厂商在换电池。 这样来看,基本上可以断定就是由于时间不同步导致脑裂,我们知道,在有HA等第三方cluster 软件时,io fencing是由其来完成的,所以这里我还不知道具体这个阈值是多少? 换句话说,就是超过多少秒?才出现脑裂呢?难得是前面的660s? 最后补充一下,lmon主要是检查如下两种心跳; 1. 网络心跳 说白了,就是指cluster心跳(应该是通过ping机制来完成的) 2. controlfile心跳 ckpt进程会每隔3s去更新一次controlfile,用此来判断。 |
3 Responses to “9i rac 脑裂导致节点重启”
顶ROGER,加油,是
再顶
Its like you read my mind! You seem to know so much about this, like you wrote the book in it or something. I think that you can do with a few pics to drive the message home a bit, but other than that, this is wonderful blog. A great read. I’ll certainly be back.
Leave a Reply
You must be logged in to post a comment.