Windows Oracle Rac instance ecvited due to lck0 hung
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: Windows Oracle Rac instance ecvited due to lck0 hung
某网友反馈其一套windows 经常出现实例驱逐,从日志来看能看到如下信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
Mon Aug 22 10:48:39 2022 Archived Log entry 411592 added for thread 1 sequence 170394 ID 0x5fe72d0b dest 1: Mon Aug 22 10:55:11 2022 IPC Send timeout detected. Sender: ospid 12312 [ORACLE.EXE (LMS5)] Receiver: inst 2 binc 61425361 ospid 9432 Mon Aug 22 10:55:11 2022 IPC Send timeout detected. Sender: ospid 13620 [ORACLE.EXE (LMS0)] Receiver: inst 2 binc 61425340 ospid 6296 Mon Aug 22 10:55:11 2022 IPC Send timeout detected. Sender: ospid 7112 [ORACLE.EXE (LMS2)] Receiver: inst 2 binc 61425357 ospid 10880 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms2_7112.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.3 inc 4 for msg type 34 from opid 15 Mon Aug 22 10:55:11 2022 IPC Send timeout detected. Sender: ospid 7312 [ORACLE.EXE (LMS4)] Receiver: inst 2 binc 61425357 ospid 8140 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms4_7312.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.5 inc 4 for msg type 34 from opid 17 Mon Aug 22 10:55:12 2022 Communications reconfiguration: instance_number 2 Mon Aug 22 10:55:12 2022 IPC Send timeout detected. Sender: ospid 2416 [ORACLE.EXE (LMS3)] Receiver: inst 2 binc 61425357 ospid 8068 Mon Aug 22 10:55:12 2022 IPC Send timeout detected. Sender: ospid 16104 [ORACLE.EXE (LMS1)] Receiver: inst 2 binc 61425341 ospid 10436 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms1_16104.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.2 inc 4 for msg type 65518 from opid 14 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms5_12312.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.6 inc 4 for msg type 65518 from opid 18 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms0_13620.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.1 inc 4 for msg type 34 from opid 13 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms3_2416.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.4 inc 4 for msg type 65518 from opid 16 Mon Aug 22 10:55:19 2022 IPC Send timeout detected. Receiver ospid 12312 [ Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lms5_12312.trc: Mon Aug 22 10:55:25 2022 Detected an inconsistent instance membership by instance 1 Evicting instance 2 from cluster Waiting for instances to leave: 2 Dumping diagnostic data in directory=[cdmp_20220822105529], requested by (instance=2, osid=1748 (LMD0)), summary=[abnormal instance termination]. Mon Aug 22 10:55:42 2022 IPC Send timeout detected. Sender: ospid 12484 [ORACLE.EXE (LMD0)] Receiver: inst 2 binc 61425219 ospid 1748 Mon Aug 22 10:55:44 2022 Remote instance kill is issued with system inc 6 Remote instance kill map (size 1) : 2 LMON received an instance eviction notification from instance 1 The instance eviction reason is 0x20000000 The instance eviction map is 2 Reconfiguration started (old inc 4, new inc 8) List of instances: 1 (myinst: 1) Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl1\trace\orcl1_lmd0_12484.trc: ORA-27508: åéä¿¡æ¯æ¶åç IPC é误 ORA-27300: æä½ç³»ç»ç³»ç»ç¸å
³æä½: IPCSOCK_Send 失败, ç¶æ为: 10054 ORA-27301: æä½ç³»ç»æ
éæ¶æ¯: Ô¶³ÌÖ÷»úÇ¿ÆȹرÕÁËÒ»¸öÏÖÓеÄÁ¬½Ó¡£ ORA-27302: é误åçå¨: send_3 IPC Send timeout to 2.0 inc 4 for msg type 65521 from opid 12 Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out |
此时查看问题节点的alert log信息就非常明显直观了,看上去也非常常见:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
Mon Aug 22 10:46:07 2022 Archived Log entry 411588 added for thread 2 sequence 18127 ID 0x5fe72d0b dest 1: Mon Aug 22 10:54:44 2022 minact-scn master exiting with err:12751 Mon Aug 22 10:55:12 2022 IPC Send timeout detected. Receiver ospid 9432 [ Mon Aug 22 10:55:12 2022 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_lms5_9432.trc: IPC Send timeout detected. Receiver ospid 6296 [ Mon Aug 22 10:55:16 2022 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_lms0_6296.trc: IPC Send timeout detected. Sender: ospid 9432 [ORACLE.EXE (LMS5)] Receiver: inst 1 binc 1098997858 ospid 12312 IPC Send timeout detected. Receiver ospid 8068 [ Mon Aug 22 10:55:20 2022 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_lms3_8068.trc: Mon Aug 22 10:55:24 2022 Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_lms5_9432.trc: ORA-27508: 发送信息时发生 IPC 错误 ORA-27300: 操作系统系统相关操作: IPCSOCK_Send 失败, 状态为: 10054 ORA-27301: 操作系统故障消息: Զ������ǿ�ȹر���һ�����е����ӡ� ORA-27302: 错误发生在: send_3 IPC Send timeout to 1.6 inc 4 for msg type 65521 from opid 18 Mon Aug 22 10:55:24 2022 Suppressed nested communications reconfiguration: instance_number 1 Detected an inconsistent instance membership by instance 1 Mon Aug 22 10:55:29 2022 Received an instance abort message from instance 1 Please check instance 1 alert and LMON trace files for detail. Mon Aug 22 10:55:29 2022 System state dump requested by (instance=2, osid=1748 (LMD0)), summary=[abnormal instance termination]. System State dumped to trace file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_diag_10524_20220822105529.trc LMD0 (ospid: 1748): terminating the instance due to error 481 Mon Aug 22 10:55:29 2022 |
典型的ipc send timeout,然后出现通信错误,最终出现instance member kill。
在分析节点1的lgwr trace发现也出现了error 10054,与前面的alert log中ora-27300 error一致:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
*** 2022-08-22 10:49:49.484 IPCSendMsg: could not initiate send on conn 0x7b561750 to node [orc : 2892 : 10880 : 614253390], err 10054 IPCGetRequestInfo: failed a request rqh(0xa8564458), type(6), status(2), bytes(0) IPCSendMsg: could not initiate send on conn 0x7b561508 to node [orc : 2892 : 6296 : 614253218], err 10054 IPCGetRequestInfo: failed a request rqh(0xa8564168), type(6), status(2), bytes(0) *** 2022-08-22 10:55:24.993 Warning: log write elapsed time 1726ms, size 1KB *** 2022-08-22 10:55:53.159 2022-08-22 10:55:53.159658 : GSIPC:KSXPCB: msg 0x000000868CA4F228 status 34, type 73, dest 2, rcvr 1 2022-08-22 10:55:53.159850 : GSIPC:KSXPCB: msg 0x000000868D3361E8 status 34, type 73, dest 2, rcvr 1 2022-08-22 10:55:53.160040 : GSIPC:KSXPCB: msg 0x000000868D0CEA90 status 34, type 73, dest 2, rcvr 1 2022-08-22 10:55:53.160088 : GSIPC:KSXPCB: msg 0x000000868D1E8818 status 34, type 73, dest 2, rcvr 1 2022-08-22 10:55:53.160128 : GSIPC:KSXPCB: msg 0x000000868C9E8348 status 34, type 73, dest 2, rcvr 1 2022-08-22 10:55:53.160170 : GSIPC:KSXPCB: msg 0x000000868D224C00 status 34, type 73, dest 2, rcvr 1 2022-08-22 10:55:53.160326 : GSIPC:KSXPCB: msg 0x000000868CD50620 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160369 : GSIPC:KSXPCB: msg 0x000000868D07B1B0 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160408 : GSIPC:KSXPCB: msg 0x000000868D30DDB0 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160447 : GSIPC:KSXPCB: msg 0x000000868D392268 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160483 : GSIPC:KSXPCB: msg 0x000000868CEB4660 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160524 : GSIPC:KSXPCB: msg 0x000000868D2574E8 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160564 : GSIPC:KSXPCB: msg 0x000000868C9AAC00 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160605 : GSIPC:KSXPCB: msg 0x000000868CEF79B0 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160647 : GSIPC:KSXPCB: msg 0x000000868CA16D38 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160687 : GSIPC:KSXPCB: msg 0x000000868CD83D90 status 34, type 73, dest 2, rcvr 3 2022-08-22 10:55:53.160725 : GSIPC:KSXPCB: msg 0x000000868CCEFCF8 status 34, type 73, dest 2, rcvr 3 *** 2022-08-22 10:55:53.160 Warning: log write broadcast wait time 27115ms (SCN 0xd.70ce1b5c) *** 2022-08-22 10:55:53.160 Warning: log write broadcast wait time 25081ms (SCN 0xd.70ce1b5f) |
进一步分析节点2的日志中发现lck 进程在故障时间点表现异常,出于hung状态:
1 2 3 4 5 6 7 8 |
*** 2022-08-22 10:50:02.433 ============================== LCK0 (ospid: 9164) has not moved for 35 sec (1661136601.1661136566) kjfmGCR_HBCheckAll: LCK0 (ospid: 9164) has status 2 : waiting for event 'libcache interrupt action by LCK' for 16 secs with wait_id 13822186. ===[ Wait Chain ]=== Wait chain is empty. kjgcr_SlaveReqBegin: message queued to slave |
根据该时间点推算,实际上lck0进程在10:49:27开始出于not moved状态,其中从10:49:46开始等待’libcache interrupt action by LCK’。
通过检查diag dump发现该进程在等待IV lock:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
PROCESS 49: LCK0 ---------------------------------------- SO: 0x00000042D11C4B90, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x00000042D11C4B90, name=process, file=ksu.h LINE:13008, pg=0 (process) Oracle pid:49, ser:2, calls cur/top: 0x000000403E2667A0/0x000000403E2667A0 flags : (0x6) SYSTEM flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 22 last post received-location: ksb2.h LINE:857 ID:ksbria last process to post me: 0x42d11c4b90 2 6 last post sent: 0 0 22 last post sent-location: ksb2.h LINE:857 ID:ksbria last process posted by me: 0x42d11c4b90 2 6 (latch info) wait_event=0 bits=0x0 Process Group: DEFAULT, pseudo proc: 0x00000042D1472028 O/S info: user: SYSTEM, term: RAC2, ospid: 9164 OSD pid info: Windows thread id: 9164, image: ORACLE.EXE (LCK0) 08-22-2022 10:55:34.145 ...... 1: waiting for 'enq: IV - contention' type|mode=0x49560003, id1=0x53594e43, id2=0x3b wait_id=13822187 seq_num=54482 snap_id=1 wait times: snap=5 min 47 sec, exc=5 min 47 sec, total=5 min 47 sec wait times: max=infinite, heur=5 min 47 sec wait counts: calls=810 os=810 in_wait=1 iflags=0x1522 0: waiting for 'libcache interrupt action by LCK' location=0x1, =0x0, =0x0 wait_id=13822186 seq_num=54481 snap_id=1 wait times: snap=0.000000 sec, exc=0.000271 sec, total=5 min 47 sec wait times: max=infinite, heur=5 min 47 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x15a2 |
根据时间点倒推5分47秒,差不多也在10:49:30之前,开始出现enq:iv – contention. 对于该等待,我们通常知道极有可能是因为rac节点两边cpu 数量不一致等,到ges进程数据不一致进而引发该问题,mos方面也有一些bug。
然而这里经过确认不存在cpu不一致的情况。同时从存活节点的diag来看,在10:50之后开始出现不少session hung的情况,这应该也是因为节点1 lck0 hung的缘故导致。 通过该查看该进程的history event可以看到p1,p2的情况:
1 2 3 4 5 6 |
[115 samples, 10:53:33 - 10:55:29] waited for 'enq: IV - contention', seq_num: 54482 p1: 'type|mode'=0x49560003 p2: 'id1'=0x53594e43 p3: 'id2'=0x3b time_waited: >= 115 sec (still in wait) |
而这里p1表示type+mode,p2表示lock reason。根据16进制转换大致理解为 synchronize library cache object invalidation。
由此我不得不怀疑在该时间段是否存在DDL等操作,进而引发了该问题。
由于该系统压力并不大,而且硬件配置不低,每个实例session在200-500之间,存储为全闪。因此我认为Bug的可能性较大。
其中mos也提到了一个bug,据说在11.2.0.4和12.2 fix。不过我认为11.2.0.4版本中仍然存在该问题。
Leave a Reply
You must be logged in to post a comment.