Rac 实例被莫名重启案例分析
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: Rac 实例被莫名重启案例分析
1 |
某客户的一套rac,其中node1 被莫名的重启了,如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Wed Aug 14 08:25:54 2013 Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc: ORA-29740: evicted by member , group incarnation Wed Aug 14 08:25:54 2013 Doing block recovery for file 204 block 35113 Wed Aug 14 08:26:00 2013 Shutting down instance (abort) License high water mark = 5209 Wed Aug 14 08:26:03 2013 Termination issued to instance processes. Waiting for the processes to exit Wed Aug 14 08:26:09 2013 Instance termination failed to kill one or more processes Instance terminated by LMON, pid = 15573208 Wed Aug 14 08:26:10 2013 Instance terminated by USER, pid = 15335602 Wed Aug 14 08:27:17 2013 Starting ORACLE instance (normal) |
1 |
同时在该时间点,rac的另外一个节点进行了reconfig,信息如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
<pre class="brush:php">Wed Aug 14 08:23:20 2013 IPC Send timeout detected.Sender: ospid 454822 Receiver: inst 1 binc 57382989 ospid 14856422 Wed Aug 14 08:23:29 2013 IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10 Wed Aug 14 08:23:29 2013 Communications reconfiguration: instance_number 1 Wed Aug 14 08:23:30 2013 Trace dumping is performing id=[cdmp_20130814082329] Wed Aug 14 08:25:19 2013 Evicting instance 1 from cluster Wed Aug 14 08:26:09 2013 Reconfiguration started (old inc 4, new inc 8) List of nodes: 1 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed Aug 14 08:26:10 2013 LMS 5: 23 GCS shadows cancelled, 1 closed Wed Aug 14 08:26:10 2013 |
1 |
从上面这部分信息看不出什么东西,我们仔细来看下node1 的alert log信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
<pre class="brush:php">Wed Aug 14 08:23:20 2013 IPC Send timeout detected. Receiver ospid 14856422 ++++根据ospid我们发现,14856422 正是lms3. Receiver is waiting for a latch dumping latch state for receiver -16444 Wed Aug 14 08:23:50 2013 Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc: Wed Aug 14 08:23:54 2013 Trace dumping is performing id=[cdmp_20130814082329] Wed Aug 14 08:25:51 2013 Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc: Wed Aug 14 08:25:54 2013 Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc: ORA-29740: evicted by member 1, group incarnation 6 Wed Aug 14 08:25:54 2013 +++++lmon 进程终止,抛出ora-29740错误 LMON: terminating instance due to error 29740 Wed Aug 14 08:25:54 2013 Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc: ORA-29740: evicted by member , group incarnation …….. …… Wed Aug 14 08:25:54 2013 Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc: ORA-29740: evicted by member , group incarnation Wed Aug 14 08:25:54 2013 Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc: ORA-29740: evicted by member , group incarnation Wed Aug 14 08:25:54 2013 Doing block recovery for file 204 block 35113 Wed Aug 14 08:26:00 2013 Shutting down instance (abort) |
同时,我们查看node 2节点的alert log也发现出现了node 1节点lms3进程超时的信息,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Wed Aug 14 08:23:20 2013 IPC Send timeout detected.Sender: ospid 454822 +++++ crm2 节点的lms3进程 Receiver: inst 1 binc 57382989 ospid 14856422 +++++ crm1 节点的lms3进程 Wed Aug 14 08:23:29 2013 IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10 Wed Aug 14 08:23:29 2013 Communications reconfiguration: instance_number 1 Wed Aug 14 08:23:30 2013 Trace dumping is performing id=[cdmp_20130814082329] Wed Aug 14 08:25:19 2013 Evicting instance 1 from cluster Wed Aug 14 08:26:09 2013 Reconfiguration started (old inc 4, new inc 8) List of nodes: 1 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Wed Aug 14 08:26:10 2013 LMS 5: 23 GCS shadows cancelled, 1 closed Wed Aug 14 08:26:10 2013 |
1 |
通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc,可以确认, |
1 |
<strong>454822</strong> 进程为crm2节点的lms3进程. 如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
<pre class="brush:php">Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine and Real Application Testing options ORACLE_HOME = /oracle/app/product/10.2.0/db_1 System name: AIX Node name: ngcrmdb2 Release: 3 Version: 5 Machine: 00C8A6E24C00 Instance name: ngcrmdb2 Redo thread mounted by this instance: 0 <none> Oracle process number: 10 Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3) ...... ...... *** 2013-08-12 10:34:58.266 kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192) *** 2013-08-14 08:23:20.471 *** 2013-08-14 08:23:20.474 IPC Send timeout detected.Sender: ospid 454822 Receiver: inst 1 binc 57382989 ospid 14856422 SKGXPCTX: 0x1104103b0 ctx WAIT HISTORY Time(msec) Wait Type Return Code ---------- --------- ------------ 14 NORMAL POSTED 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT 30 NORMAL TIMEDOUT wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2 user cpu time since last wait 0 sec 0 ticks system cpu time since last wait 0 sec 0 ticks locked 1 blocked 8 timed wait receives 87 admno 0xbc1dc98 admport: SSKGXPT 0x104111c0 flags socket no 11 IP 192.168.5.2 UDP 33191 context timestamp 0x57cd2bd2 buffers queued on port 1104cff40 sconno accono ertt state seq# sent async sync rtrans acks 0x7dd68acd 0x12d3831a 1024 3 42918 292492860 292492982 0 2695 286577042 slot 0 rqh=1110ca6c8 seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1 slot 1 rqh=1110ca2e0 seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1 slot 2 rqh=1110c9ef8 …… |
我们可以看出是因为实例之间lms3进程通信交互出现超时导致crm1节点
被驱逐重启。根据crm1节点的alert log信息,我们可以整理一个如下的时间关系:
Aug 14 08:23:20 –lms3 Time out(lmon check发现)
Aug 14 08:25:54 –抛出ORA-29740: evicted by member 1, group incarnation 6错误,lmon 进程终止
Aug 14 08:25:54 –完成block recover(Doing block recovery for file 204 block 35113),由smon完成.
Aug 14 08:26:00 –crm1实例 abort关闭.
通过node 1节点的告警日志,我们可以发现,在Aug 14 08:23:20 时刻,lms3进程出现time out。通常来讲,lms进程出现超时,可能是由于系统压力过大,即load过高导致。 下面通过分析lgwr,lmd以及lmon 进程的trace来证实这一点。 首先我们来分析一下lgwr trace:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
*** 2013-08-13 02:29:49.603 Warning: log write time 1500ms, size 278KB *** 2013-08-13 03:24:00.496 Warning: log write time 1440ms, size 13397KB *** 2013-08-13 03:24:04.213 Warning: log write time 540ms, size 18775KB *** 2013-08-13 03:24:05.298 Warning: log write time 1080ms, size 10027KB *** 2013-08-13 03:24:09.590 Warning: log write time 540ms, size 10124KB *** 2013-08-13 03:24:10.114 Warning: log write time 520ms, size 18577KB *** 2013-08-13 03:24:14.393 Warning: log write time 540ms, size 16529KB *** 2013-08-13 03:24:15.400 Warning: log write time 1010ms, size 13973KB *** 2013-08-13 03:24:16.813 Warning: log write time 500ms, size 20055KB *** 2013-08-13 03:24:19.572 Warning: log write time 1360ms, size 11320KB *** 2013-08-13 03:24:22.962 Warning: log write time 760ms, size 18383KB *** 2013-08-13 03:24:25.383 Warning: log write time 510ms, size 13886KB *** 2013-08-13 03:24:26.791 Warning: log write time 1410ms, size 14385KB *** 2013-08-13 03:24:41.486 Warning: log write time 890ms, size 11268KB *** 2013-08-13 03:24:42.371 Warning: log write time 880ms, size 27178KB *** 2013-08-13 03:24:55.580 Warning: log write time 570ms, size 14009KB *** 2013-08-13 03:24:58.326 Warning: log write time 2730ms, size 31518KB *** 2013-08-13 09:53:36.967 Warning: log write time 1740ms, size 5KB *** 2013-08-13 12:10:41.138 Warning: log write time 580ms, size 2KB *** 2013-08-13 14:39:31.894 Warning: log write time 1080ms, size 102KB *** 2013-08-14 00:51:42.142 Warning: log write time 800ms, size 347KB *** 2013-08-14 01:20:31.054 Warning: log write time 510ms, size 1455KB *** 2013-08-14 01:24:35.160 Warning: log write time 500ms, size 463KB *** 2013-08-14 01:24:36.246 Warning: log write time 1080ms, size 887KB *** 2013-08-14 01:24:37.135 Warning: log write time 890ms, size 2201KB *** 2013-08-14 02:04:42.532 Warning: log write time 590ms, size 14KB *** 2013-08-14 02:06:04.027 Warning: log write time 570ms, size 43KB *** 2013-08-14 03:25:27.431 Warning: log write time 500ms, size 13805KB *** 2013-08-14 03:25:42.751 Warning: log write time 650ms, size 1919KB *** 2013-08-14 03:25:52.230 Warning: log write time 1130ms, size 35575KB *** 2013-08-14 03:25:52.737 Warning: log write time 500ms, size 9288KB *** 2013-08-14 03:25:54.660 Warning: log write time 1020ms, size 9277KB *** 2013-08-14 03:25:55.581 Warning: log write time 910ms, size 36265KB *** 2013-08-14 03:25:56.369 Warning: log write time 500ms, size 36267KB *** 2013-08-14 03:25:58.585 Warning: log write time 1010ms, size 14202KB tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x1) tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x1) *** 2013-08-14 03:59:30.005 Warning: log write time 540ms, size 89KB *** 2013-08-14 08:25:54.390 error 29740 detected in background process ORA-29740: evicted by member , group incarnation |
lgwr 进程trace仅仅截取了一小部分,从这部分内容来看,在凌晨03:25~03:59这个时间段,出现了log write 超时。通过对比可以发现,其实每天的03:25左右,lgwr write超时都是最严重的,说明这个时间点系统的IO压力是最大的.
然而03:59~ 08:25之间并没有出现别的信息,说明这个时间端内,系统的IO压力应该不大,但是这并不能说明系统的负载不高。下面我们继续分析lmon 进程的trace内容
Trace ngcrmdb1_lmon_6627610.trc内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
Unix process pid: 6627610, image: oracle@ngcrmdb1 (LMON) *** SERVICE NAME:() 2013-08-14 08:27:21.684 *** SESSION ID:(13202.5) 2013-08-14 08:27:21.684 GES resources 242662 pool 56 GES enqueues 378762 GES IPC: Receivers 15 Senders 15 GES IPC: Buffers Receive 1000 Send (i:24750 b:1913) Reserve 1000 GES IPC: Msg Size Regular 408 Batch 8192 Batching factor: enqueue replay 201, ack 224 Batching factor: cache replay 126 size per lock 64 kjxggin: receive buffer size = 32768 high load threshold = 200704 ++++ load过高,已经超过threshold *** 2013-08-14 08:27:26.865 kjxgmrcfg: Reconfiguration started, reason 1 +++++ 实例进行reconfiguartion,说明这个时间点,crm1节点被restart了 kjxgmcs: Setting state to 0 0. +++++ reason为0,表示不属于cluster了 *** 2013-08-14 08:27:26.867 Name Service frozen kjxgmcs: Setting state to 0 1. +++++ 节点被重新加入到cluster,状态改变为1(表示被monitor) kjxgrssvote: reconfig bitmap chksum 0xfa98ff9 cnt 2 master 0 ret 0 kjfcpiora: published my fusion master weight 7699028 kjfcpiora: published my enqueue weight 242662 kjfcpiora: publish my flogb 16 kjfcpiora: publish my cluster_database_instances parameter = 2 kjxggpoll: change poll time to 50 ms kjxgmps: proposing substate 2 kjxgmcs: Setting state to 10 2. kjfmuin: bitmap 0 1 kjfmmhi: received msg from 0 (inc 10) kjfmmhi: received msg from 1 (inc 4) Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting state to 10 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing substate 4 kjxgmcs: Setting state to 10 4. Multicasted all local name entries for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 10 5. Name Service normal Name Service recovery done *** 2013-08-14 08:27:27.464 kjxgmps: proposing substate 6 kjxgmcs: Setting state to 10 6. *** 2013-08-14 08:27:27.565 *** 2013-08-14 08:27:27.565 Reconfiguration started (old inc 0, new inc 10) Dynamic remastering is disabled Synchronization timeout interval: 900 sec List of nodes: 0 1 *** 2013-08-14 08:27:27.565 kjxggpoll: change poll time to 600 ms Global Resource Directory frozen node 0 release 10 2 0 4 node 1 release 10 2 0 4 number of mastership buckets = 128 。。。。。。 Post SMON to start 1st pass IR Fix write in gcs resources sent syncr inc 10 lvl 8 to 0 (10,24/0/0) sent synca inc 10 lvl 8 (10,24/0/0) *** 2013-08-14 08:27:36.170 +++++reconfig完成 Reconfiguration complete * domain 0 valid?: 1 *** 2013-08-14 08:27:41.581 kjxgrtmc2: mounting member 0 thread 1 +++++mount thread(crm1) |
1 2 3 4 |
从该trace来看,2013-08-14 08:27:26开始reconfig,到2013-08-14 08:27:41 reconfig完成。一共经历了15秒。从上面信息来判断, lms3进程压力偏大,然后这里的load仅仅是指单个进程,而并非整个系统。 在rac中,lmon进程主要作用是监控集群的global enqueues和resources资源,且会执行global enqueue recovery。实例异常终止后,lmon进程会进行gcs内存方面的处理。当一个实例 加入或者离开集群后,lmon进程会对lock和resource进行reconfiguration. |
1 |
此外lmon 进程还会在cluster中不同的实例间进行通讯检查,如果发现对方通讯超时,就会发出节点eviction,将该node驱逐。 |
1 |
从上面内容可以看出,确实是lms3 进程交互出现超时,lmon发出驱逐命令,将crm1节点实例进行了重启。 |
1 |
在另外一个ngcrmdb1_lmon_15573208.trc 中记录了lms 进程的详细信息,如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
*** 2013-07-28 07:12:17.257 Reconfiguration complete * domain 0 valid?: 1 *** 2013-08-14 08:20:30.089 kjfmrcvrchk: receiver LMS[3] check timed out (1376439430.1376439630.0). +++发现lms3 time out kjfmrcvrchk: Dumping callstack of lms3 Submitting asynchronized dump request [20] *** 2013-08-14 08:23:29.891 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 +++进行reconfig 保护,避免重启 kjxgrrcfgchk: Initiating reconfig, reason 3 kjxgrrcfgchk: Increased trace level for comm reconfig from 0 to 10 kjxgrrcfgchk: prev pstate 5 mapsz 512 kjxgrrcfgchk: new bmp: 0 1 kjxgrrcfgchk: work bmp: 0 1 kjxgrrcfgchk: rr bmp: 0 1 kjxgrrcfgchk: the instance is NOT in good health. *** 2013-08-14 08:23:30.496 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 +++reason 3,表示集群中的node通讯失败 *** 2013-08-14 08:23:31.100 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 ....... ....... kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:37.161 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:23:37.162 kjxgrmsghndlr: Recvd msg 1104e60a0 type 5 from 1, seq (4,5) *** 2013-08-14 08:23:37.162 Received heartbeat ack from 1 *** 2013-08-14 08:23:37.767 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:38.391 ....... ....... kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:42.040 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:23:42.041 kjxgrmsghndlr: Recvd msg 1104e6640 type 5 from 1, seq (4,5) *** 2013-08-14 08:23:42.041 Received heartbeat ack from 1 *** 2013-08-14 08:23:42.646 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:43.291 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:43.897 ....... ....... *** 2013-08-14 08:25:14.505 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:25:14.506 kjxgrmsghndlr: Recvd msg 1104eda80 type 5 from 1, seq (4,5) *** 2013-08-14 08:25:14.506 Received heartbeat ack from 1 *** 2013-08-14 08:25:15.110 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 ....... ....... *** 2013-08-14 08:25:19.400 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:25:19.400 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:25:19.402 kjxgrmsghndlr: Recvd msg 1104ee020 type 5 from 1, seq (4,5) *** 2013-08-14 08:25:19.402 Received heartbeat ack from 1 *** 2013-08-14 08:25:20.005 kjxgrdtrt: Evicted by 1, seq (6, 4) IMR state information Member 0, thread 1, state 0x4:4, flags 0x0040 RR seq commit 4 cur 4 Propstate 5 prv 4 pending 0 system load 3402 (normal) Member information: Member 0, incarn 2, version 0x36b98a7, thrd 1 prev thrd 65535, status 0x0007 (JRM.), err 0x0000 Member 1, incarn 4, version 0x370baab, thrd 2 prev thrd 65535, status 0x0007 (JRM.), err 0x0000 |
1 |
根据上面信息整理一个如下时间关系: |
1 2 3 |
2013-08-14 08:20:30 ----check lms3 出现time out 2013-08-14 08:23:29 ----提高process level,进行reconfig detect保护 2013-08-14 08:25:20 ----发出Evicted 申请 |
1 2 |
前面我们已经知道了是由于lms3进程交互出现超时导致实例被重启。那么crm系统在该时间点load真的很高吗? 我们通过数据来进行验证:我们重点来看下lms3 进程的diag信息,如下: |
1 2 3 4 5 6 7 8 9 |
<pre class="brush:php">*** 2013-08-14 08:25:27.917 Dumping diagnostic information for ospid 14856422: OS pid = 14856422 loadavg : 2.49 3.72 5.99 swap info: free_mem = 7493.79M rsv = 256.00M alloc = 2105.66M avail = 65536.00M swap_free = 63430.34M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240103 A oracle 14856422 1 4 39 -- 1f059f6510 110936 Jul 28 - 1414:58 ora_lms3_ngcrmdb1 open: The file access permissions do not allow the specified action. |
1 |
1 |
可以看到,该进程上次启动时间是7月28号,该进程的pri值是39. 该进程的平均load为:2.49 3.72 5.99 |
1 |
从这里看来,操作系统load 并不高。通过grep 该trace,可以看到其他lms进程的loadavg 都相差不大,如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
[root@killdb ~]# cat ngcrmdb1_lmon_15573208.trc |grep 'loadavg :' loadavg : 2.21 3.71 6.01 loadavg : 2.35 3.72 6.00 loadavg : 2.35 3.72 6.00 loadavg : 2.49 3.72 5.99 loadavg : 2.49 3.72 5.99 loadavg : 2.49 3.72 5.99 loadavg : 2.85 3.78 6.00 loadavg : 2.85 3.78 6.00 loadavg : 2.86 3.76 5.98 loadavg : 2.86 3.76 5.98 loadavg : 2.86 3.76 5.98 loadavg : 2.79 3.74 5.96 loadavg : 2.79 3.74 5.96 loadavg : 2.81 3.72 5.94 loadavg : 2.81 3.72 5.94 loadavg : 2.81 3.72 5.94 |
1 2 |
到这里,我们可以发现,lms3 进程load其实并不算高,那么为什么会等待超时呢? 以至于rac实例之间lms3 进程交互出现超时导致实例被驱逐重启。 那么lms3 进程等待什么呢 ?我们来看下node1 lms3进程的信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
可以发现lms3 一直在等待cache buffer chains,且wait time很长,以至于出现超时. 一般情况下,cbc latch的持有时间 都是非常短暂的,通常都在毫秒级别,然而这里其等待时间却高达几百秒.前面发现lms 3进程申请的资源70000150072db50 被 orapid 1386持有,那么我们观察一下orapid 1386是什么进程: <pre class="brush:php">SO: 70000154841a800, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=1386, calls cur/top: 70000146b345730/700001469390a10, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 109 0 4 last post received-location: kslpsr last process to post me: 7000015453a5098 1 6 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: 7000015453a5098 1 6 (latch info) wait_event=0 bits=2 holding (efd=16) 70000150072db50 Child cache buffers chains level=1 child#=476313 Location from where latch is held: kcbgtcr: kslbegin excl: Context saved from call: 557847035 state=busy(exclusive) (val=0x200000000000056a) holder orapid = 1386 waiters [orapid (seconds since: put on list, posted, alive check)]: 1594 (1051, 1376439951, 2) 53 (1033, 1376439951, 2) 98 (1032, 1376439951, 2) 1502 (1011, 1376439951, 2) 69 (1009, 1376439951, 2) 124 (960, 1376439951, 2) 213 (950, 1376439951, 2) 191 (948, 1376439951, 2) 75 (712, 1376439951, 2) 1359 (695, 1376439951, 2) 424 (650, 1376439951, 2) 1572 (636, 1376439951, 2) 89 (591, 1376439951, 2) 5168 (588, 1376439951, 2) 33 (521, 1376439951, 2) ++++33,就是lms3 进程 1296 (394, 1376439951, 2) 4049 (334, 1376439951, 2) 163 (333, 1376439951, 2) 4437 (91, 1376439951, 2) 4459 (32, 1376439951, 2) waiter count=20 Process Group: DEFAULT, pseudo proc: 7000015446c94e8 O/S info: user: oracle, term: UNKNOWN, ospid: 7504228 OSD pid info: Unix process pid: 7504228, image: <a href="mailto:oracle@ngcrmdb1">oracle@ngcrmdb1</a> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
可以清楚的看到,lms3进程确实在等待pid 1386这个进程,换句话讲,lms3 进程在等待cbc latch,而该cbc latch resource 却被pid 1386所持有,一直没有释放;而且不仅仅是lms3进程在等待,还有其他19个进程也在等待.Pid 1386 进程的历史等待 事件如下: <pre class="brush:php">SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=340 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=149098 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=3 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=269 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=318 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=378 buffer#=6e6d, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1031 driver id=28444553, #bytes=1, =0 temporary object counter: 0 ………. |
从上面信息可以发现,orapid 1386这个session是地址10.200.141.139 登录的,主机名为:ngibweb3。
从上面信息来看,该进程的等待event是: SQL*Net message from client ,且该进程持有cbc latch长时间不释放。
对于进程持有cbc latch长时间不释放,主要有几种可能性:
从上面信息来看,该进程的等待event是: SQL*Net message from client ,且该进程持有cbc latch长时间不释放。
对于进程持有cbc latch长时间不释放,主要有几种可能性:
1. 进程hang住;
2. oracle bug或操作系统hang住.
Lms3 进程等待超过5分钟,且cbc resource被该进程持有,由此来看,该session 进程可能hang住了.
2. oracle bug或操作系统hang住.
Lms3 进程等待超过5分钟,且cbc resource被该进程持有,由此来看,该session 进程可能hang住了.
那么如何确认该进程可能已经hang住了 ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
*** 2013-08-14 08:06:53.459 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 9.49 9.52 8.51 swap info: free_mem = 9215.18M rsv = 256.00M alloc = 2086.34M avail = 65536.00M swap_free = 63449.66M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 7504228 1 32 76 20 1af21a7510 96324 f10006000ae020c8 Jul 28 - 568:26 oraclengcrmdb1 (LOCAL=NO) ........... ........... SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 3, prv: 0, sql: 70000145c8bbba0, psql: 70000145c969a38, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=41178 wait_time=325 seconds since wait started=0 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=325 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=2148096 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=4 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=345 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=386 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=482 buffer#=693f, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1267 driver id=28444553, #bytes=1, =0 temporary object counter: 0 *** 2013-08-14 08:08:13.068 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 9.51 9.48 8.57 swap info: free_mem = 9158.54M rsv = 256.00M alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 7504228 1 33 76 20 1af21a7510 98492 Jul 28 - 568:29 oraclengcrmdb1 (LOCAL=NO) open: The file access permissions do not allow the specified action. procstack: write(/proc/7504228/ctl): The requested resource is busy. 7504228: oraclengcrmdb1 (LOCAL=NO) 0x0000000000000000 ????????() + ?? ........ ........ SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 3, prv: 0, sql: 70000145ac1ef50, psql: 70000145db36f00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'gc current request' blocking sess=0x0 seq=42098 wait_time=432 seconds since wait started=0 file#=23f, block#=fe61a, id#=2000001 Dumping Session Wait History for 'gc current block 2-way' count=1 wait_time=432 =23f, =fe61a, =2000001 for 'gc current block 2-way' count=1 wait_time=305 =1be, =64f8, =2000001 for 'gc cr block 2-way' count=1 wait_time=423 =1be, =64f8, =1 for 'gc current block 2-way' count=1 wait_time=482 =2b3, =24e03, =2000001 for 'gc current block 2-way' count=1 wait_time=327 =1be, =64b7, =2000001 for 'gc cr block 2-way' count=1 wait_time=413 =1be, =64b7, =1 for 'gc current block 2-way' count=1 wait_time=366 =23f, =fe681, =2000001 for 'gc current block 2-way' count=1 wait_time=418 =1be, =64a9, =2000001 for 'gc cr block 2-way' count=1 wait_time=453 =1be, =64a9, =1 for 'gc current block 2-way' count=1 wait_time=454 =1b4, =fa1c2, =2000001 temporary object counter: 0 *** 2013-08-14 08:08:17.929 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 9.63 9.50 8.59 swap info: free_mem = 9263.69M rsv = 256.00M alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 7504228 1 41 80 20 1af21a7510 96448 Jul 28 - 568:31 oraclengcrmdb1 (LOCAL=NO) open: The file access permissions do not allow the specified action. procstack: write(/proc/7504228/ctl): The requested resource is busy. 7504228: oraclengcrmdb1 (LOCAL=NO) 0x0000000000000000 ????????() + ?? *** 2013-08-14 08:08:18.810 ........... ........... SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=0 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=340 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=149098 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=3 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=269 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=318 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=378 buffer#=6e6d, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1031 driver id=28444553, #bytes=1, =0 temporary object counter: 0 *** 2013-08-14 08:25:49.363 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 2.81 3.72 5.94 swap info: free_mem = 7515.38M rsv = 256.00M alloc = 2106.36M avail = 65536.00M swap_free = 63429.64M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 T oracle 7504228 1 0 60 20 1af21a7510 96324 Jul 28 - 568:31 oraclengcrmdb1 (LOCAL=NO) ....... ....... SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=340 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=149098 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=3 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=269 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=318 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=378 buffer#=6e6d, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1031 driver id=28444553, #bytes=1, =0 temporary object counter: 0 |
1 2 |
可以看到最后的2个时间点,2013-08-14 08:08:18和2013-08-14 08:25:49,该进程都的time cmd值都没有发生变化, 且其状态都是busy. 同时执行的SQL都是70000145dad36e8,我们来看SQL:70000145dad36e8 是什么语句 ? |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<pre class="brush:php">Cursor#6(1104a1958) state=BOUND curiob=1105400e8 curflg=cd fl2=0 par=0 ses=700001542fc0208 sqltxt(70000145dad36e8)= DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP') hash=048fa56d3419181a96e347d96053b942 parent=70000145dab3218 maxchild=03 plk=70000146b2e2208 ppn=n cursor instantiation=1105400e8 used=1376438898 child#2(7000011f63f85e0) pcs=7000012b5494ea8 clk=7000013ac106198 ci=7000011ddc3d0b0 pn=70000146bb6b560 ctx=70000140e841ab8 kgsccflg=0 llk[1105400f0,1105400f0] idx=0 xscflg=c0100276 fl2=c000400 fl3=228208c fl4=100 Frames pfr 110540080 siz=11728 efr 11053ff90 siz=11672 Cursor frame dump enxt: 4.0x00000480 enxt: 3.0x00000af8 enxt: 2.0x00000ec8 enxt: 1.0x00000f58 pnxt: 2.0x00000008 pnxt: 1.0x00000030 kxscphp 1105384e8 siz=984 inu=344 nps=344 |
可以看到该进程从8:08就在执行一个delete操作,直到8:25分,经历了17分钟. 我们从trace信息来看,该进程确实被挂起了1051秒。
那么,最后我们通过该进程的trace来确认一下,这个session在干什么?信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Dump file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine and Real Application Testing options ORACLE_HOME = /oracle/app/product/10.2.0/db_1 System name: AIX Node name: ngcrmdb1 Release: 3 Version: 5 Machine: 00C8A6F24C00 Instance name: ngcrmdb1 Redo thread mounted by this instance: 1 Oracle process number: 1386 Unix process pid: 7504228, image: oracle@ngcrmdb1 *** 2013-08-14 08:25:51.146 *** ACTION NAME:() 2013-08-14 08:25:51.119 *** MODULE NAME:(JDBC Thin Client) 2013-08-14 08:25:51.119 *** SERVICE NAME:(ngcrmdb) 2013-08-14 08:25:51.119 *** SESSION ID:(11786.20) 2013-08-14 08:25:51.119 Received ORADEBUG command 'dump errorstack 3' from process Unix process pid: 5886552, image: *** 2013-08-14 08:25:51.146 ksedmp: internal or fatal error Current SQL statement for this session: DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP') ----- PL/SQL Call Stack ----- object line object handle number name 700001459e79290 66 procedure UOP_CRM1.P_UIP_MERGER_MCASDATA 70000145c91ebc8 1 anonymous block |
1 |
从进程的trace信息来看,跟我们前面的分析符合,可以看到该session在执行一个delete操作,该sql执行计划下: |
1 2 3 4 5 6 7 8 9 10 11 |
<pre class="brush:php">============ Plan Table ============ -----------------------------------------------------------+-----------------------------------+ | Id | Operation | Name | Rows | Bytes | Cost | Time | -----------------------------------------------------------+-----------------------------------+ | 0 | DELETE STATEMENT | | | | 39K | | | 1 | DELETE | TI_B_MCAS_UDR | | | | | | 2 | TABLE ACCESS BY INDEX ROWID | TI_B_MCAS_UDR | 1 | 47 | 39K | 00:08:05 | | 3 | INDEX FULL SCAN | IDX_TI_B_MCAS_UDR_3| 169K | | 1505 | 00:00:19 | -----------------------------------------------------------+-----------------------------------+ |
可以看到,该sql执行过差主要是因为进行了index full scan,需要注意的是,index full scan 进行的单块读。且由于
该环境是RAC,oracle需要在实例之间进行block的传输,确保数据的一致性以及cache fusion的作用,而该操作主要是
lms进程来完成的。然而该进程长时间持有cbc latch不释放,可能是由于该session 已经本身已经hang住。
该环境是RAC,oracle需要在实例之间进行block的传输,确保数据的一致性以及cache fusion的作用,而该操作主要是
lms进程来完成的。然而该进程长时间持有cbc latch不释放,可能是由于该session 已经本身已经hang住。
最终由于node 1节点lms3 进程长时间获得不到cbc latch而出现超时,被lmon检查发现,以至于最后被lmon进程将实例重启。
3 Responses to “Rac 实例被莫名重启案例分析”
[…] 详见原文博客链接地址:Rac 实例被莫名重启一列分析 本条目发布于 2013 年 9 月 1 日。属于 数据库教程 分类。作者是 admin。 0 次浏览 document.getElementById("bdshell_js").src = "http://bdimg.share.baidu.com/static/js/shell_v2.js?cdnversion=" + Math.ceil(new Date()/3600000) […]
[…] 详见原文博客链接地址:Rac 实例被莫名重启一列分析 本文链接: rac 实例被莫名重启的案例分析 版权所有: 非特殊声明均为本站原创文章,转载请注明出处:开发者 订阅更新: 您可以通过RSS订阅我们的内容更新 […]
RAC 间超时不应该重启lowest lowest numbered node 吗,不为什么不重启2号节点?
Leave a Reply
You must be logged in to post a comment.