lms进程异常 导致实例重启一例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: lms进程异常 导致实例重启一例
1 2 |
下午公司内部技术部培训时,同事接到某客户的电话,某rac的其中一个接到重启了, 希望让我们帮忙分析一下具体原因,同事把相关的日志也转发给我了,简单的分析了一下, |
1 |
alert log如下: |
1 2 3 4 5 6 7 8 |
Mon Jul 11 12:47:14 2011 Thread 2 advanced to log sequence 53143 Current log# 3 seq# 53143 mem# 0: /dev/rdb_redo21_500m Mon Jul 11 12:47:54 2011 ARC3: Standby redo logfile selected for thread 2 sequence 53142 for destination LOG_ARCHIVE_DEST_2 Mon Jul 11 12:52:48 2011 Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc: <strong>ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530],</strong> |
1 2 3 4 |
<strong> [], [], [], []</strong> Mon Jul 11 12:52:51 2011 Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc: ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], |
1 2 3 4 5 6 |
[], [], [], [] Mon Jul 11 12:52:52 2011 Trace dumping is performing id=[cdmp_20110711125251] Mon Jul 11 12:52:55 2011 Errors in file /ora/app/admin/oraapp/bdump/oraapp2_lms0_1044892.trc: ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
[], [], [], [] Mon Jul 11 12:52:55 2011 LMS0: terminating instance due to error 484 Instance terminated by LMS0, pid = 1044892 Mon Jul 11 12:53:41 2011 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 en6 192.168.182.0 configured from OCR for use as a cluster interconnect Interface type 1 en4 10.66.81.0 configured from OCR for use as a public interface Picked latch-free SCN scheme 3 Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled |
1 2 |
从上面的信息来看,该节点在12点52分出现ora-600内部错误,仅仅不到10s的时间,就导致lms0进程异常中止, 最后导致该实例重启。 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
我们来看看对于的trace信息: *** 2011-07-11 12:52:48.437 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x<strong><span style="color: #0000ff">700000041FD2CE0</span></strong>], [0x<strong><span style="color: #0000ff">7000000E4CE1530</span></strong>], [], [], [], [] ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 FFFFFFFFFFFC560 ? 000000000 ? ksedmp+0290 bl ksedst 104811C98 ? ksfdmp+0018 bl 03F40AE0 kgerinv+00dc bl _ptrgl kgeasnmierr+0040 bl kgerinv 000000000 ? 000000000 ? 000000000 ? 11021E068 ? FFFFFFFFFFFCCA0 ? kjbrchkmode+0230 bl 03F3FD0C kjbriconsign+00d8 bl kjbrchkmode 000000000 ? FFFFFFFFFFFCCA8 ? kjbmiconsign+01f0 bl kjbriconsign 10517A4BC ? 000000000 ? 000000002 ? 42C98000E0000 ? 0FFFFCCF0 ? kjbmpocr+1114 bl kjbmiconsign 110BFB640 ? 11021E068 ? 114C3177CB0279 ? 110C9F850 ? FFFFFFFFFFFCCA0 ? kjmxmpm+0238 bl 03F3FB24 kjmpmsgi+050c bl kjmxmpm 0A25B7EE4 ? 600000000 ? A25B00ED00000000 ? 000000000 ? kjmsm+177c bl kjmpmsgi 110C6D820 ? 1EAD6A410 ? 100000000 ? 000000001 ? 0FFFFD780 ? 000000000 ? ksbrdp+03e0 bl _ptrgl opirip+03fc bl 01FAFCDC opidrv+0448 bl opirip 1102736F0 ? 410275030 ? FFFFFFFFFFFF310 ? sou2o+0090 bl opidrv 320318411C ? 400000010 ? FFFFFFFFFFFF310 ? opimai_real+0150 bl 01FAB1D4 main+0098 bl opimai_real 000000000 ? 000000000 ? __start+0098 bl main 000000000 ? 000000000 ? |
1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
*** 2011-07-11 12:52:51.089 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], [] ----- Call Stack Trace ----- calling call entry argument values in hex location type point (? means dubious value) -------------------- -------- -------------------- ---------------------------- ksedst+001c bl ksedst1 FFFFFFFFFFFC9E0 ? 000000000 ? ksedmp+0290 bl ksedst 104811C98 ? kjmpmsgi+0620 bl 03F40AE0 kjmsm+177c bl kjmpmsgi 110C6D820 ? 1EAD6A410 ? 100000000 ? 000000001 ? 0FFFFD780 ? 000000000 ? ksbrdp+03e0 bl _ptrgl opirip+03fc bl 01FAFCDC opidrv+0448 bl opirip 1102736F0 ? 410275030 ? FFFFFFFFFFFF310 ? sou2o+0090 bl opidrv 320318411C ? 400000010 ? FFFFFFFFFFFF310 ? opimai_real+0150 bl 01FAB1D4 main+0098 bl opimai_real 000000000 ? 000000000 ? __start+0098 bl main 000000000 ? 000000000 ? |
1 2 3 |
<strong><span style="color: #ff0000">前后2次的ora--00600信息的,虽然call stack信息有点微小差别,不过基本上一致,根据600错误以及call stack</span></strong> <strong><span style="color: #ff0000">我们在metalink上查到一个bug,对比了一下call stack信息,90%一致。虽然不能肯定的下结论,但是感觉也有点</span></strong> <strong><span style="color: #ff0000">八九不离十了的味道了。</span></strong> |
1 2 |
下面咱们继续,既然是lms0进程异常中止,那么我们这里就很有必要来分析一下lms0进程的状态了, 于是在trace中搜索 process或lms0 找到如下信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
PROCESS STATE ------------- Process global information: process: 7000000ea3a7dd8, call: 7000000ea5c9a88, xact: 0, curses: 7000000ea578b08, usrses: 7000000ea580548 ---------------------------------------- SO: 7000000ea3a7dd8, type: 2, <strong><span style="color: #0000ff">owner: 0</span></strong>, flag: INIT/-/-/0x00 ---<strong><span style="color: #0000ff">从这里owner 为0 看出该so 已经处于顶层了</span></strong> (process) Oracle pid=7, calls cur/top: 7000000ea5c9a88/7000000ea5c7680, flag: (6) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 90 last post received-location: KJCS Post snd proxy to flush msg last process to post me: 7000000ea3a8d98 1 6 last post sent: 0 0 93 last post sent-location: kjbrcvdscn: scn acked last process posted by me: 7000000ea3a8d98 1 6 (latch info) wait_event=0 bits=10 <span style="color: #0000ff"><strong>holding (efd=5) 7000000e8f688a0 Child gcs resource hash level=4 child#=20 </strong></span> Location from where latch is held: kjbmpconvert: Context saved from call: 0 state=busy, wlstate=free Process Group: DEFAULT, pseudo proc: 7000000e7418bc0 O/S info: user: oracle, term: UNKNOWN, ospid: 1044892 OSD pid info: Unix process pid: 1044892, image: <a href="mailto:oracle@t3smis_db_b">oracle@t3smis_db_b</a> (LMS0) <strong><span style="color: #000000">这里的gcs resource hash 是latch 名称,其level为4,可以查询v$latch</span></strong>. 搜索字符串<span style="color: #ff0000">7000000ea3a7dd8</span>,发现如下内容: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
SO: 7000000e1963000, type: 20, owner: 7000000ea3a7dd8, flag: -/-/-/0x00 namespace [KSXP] key = [ 32 32 30 47 45 53 52 30 30 31 00 ] ---------------------------------------- SO: 7000000ea580548, type: 4, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (session) sid: 550 trans: 0, creator: 7000000ea3a7dd8, flag: (51) USR/- BSY/-/-/-/-/- DID: 0002-0007-00000003, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS last wait for 'gcs remote message' blocking sess=0x0 seq=62654 wait_time=18631 seconds since wait started=4 waittime=18, poll=0, event=0 Dumping Session Wait History for <span style="color: #0000ff">'gcs remote message'</span> count=1 wait_time=18631 waittime=18, poll=0, event=0 for <strong><span style="color: #0000ff">'gcs remote message'</span></strong> count=1 wait_time=751 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=248 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=74 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=2099 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=7055 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=7010 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=56 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=6202 waittime=18, poll=0, event=0 for 'gcs remote message' count=1 wait_time=803 waittime=18, poll=0, event=0 temporary object counter: 0 ---------------------------------------- UOL used : 0 locks(used=0, free=0) KGX Atomic Operation Log 7000000e8fa1998 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 550 efd 0 whr 0 slp 0 KGX Atomic Operation Log 7000000e8fa19e0 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 550 efd 0 whr 0 slp 0 KGX Atomic Operation Log 7000000e8fa1a28 Mutex 0(0, 0) idn 0 oper NONE Library Cache uid 550 efd 0 whr 0 slp 0 ---------------------------------------- SO: 7000000e08f57d8, type: 41, owner: 7000000ea580548, flag: INIT/-/-/0x00 (dummy) nxc=0, nlb=0 ---------------------------------------- SO: 7000000e765b078, type: 11, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 7000000ea3a7dd8, event: 7, last message event: 93, last message waited event: 93, messages read: 1 channel: (7000000e767c640) system events broadcast channel scope: 2, event: 38038, last mesage event: 93, publishers/subscribers: 0/141, messages published: 1 ---------------------------------------- SO: 7000000ea93e9e0, type: 19, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 GES MSG BUFFERS: st=emp chunk=0x0 hdr=0x0 lnk=0x0 flags=0x0 inc=16 outq=2 sndq=0 opid=7 prmb=0x0 mbg[i]=(4626201 97644168) mbg[b]=(54246 41376233) mbg[r]=(0 0) fmq[i]=(30 22) fmq[b]=(20 8) fmq[r]=(0 0) mop[s]=64557405 mop[q]=104801381 pendq=0 zmbq=0 ------------process 0x7000000ea93e9e0-------------------- proc version : 0 Local node : 1 pid : 1044892 lkp_node : 1 svr_mode : 0 proc state : KJP_NORMAL Last drm hb acked : 0 Total accesses : 351576 Imm. accesses : 351543 Locks on ASTQ : 0 Locks Pending AST : 0 Granted locks : 0 AST_Q: PENDING_Q: GRANTED_Q: KJM HIST LMS0: 1:1 12:1:10 7:1 6:1 10:0 2:0 19:0 18:2 17:2 16:2 15:2 12:18639 7:0 6:15 10:0 17:1 16:1 14:73:4 1:0 12:1:7 7:0 6:0 10:1 17:1 16:1 12:753 7:1 6:0 10:0 17:2 16:1 13:65521:4 20:84:3 1:0 12:1:7 7:1 6:0 10:0 17:2 16:1 12:250 7:1 6:16 10:0 17:2 16:1 14:73:5 1:1 12:1:9 7:0 6:0 10:1 17:1 16:1 12:78 7:0 6:0 10:1 17:2 21:32 16:2 12:2105 7:0 6:14 DEFER MSG QUEUE ON LMS0 IS EMPTY SEQUENCES: 0:116284752.0 1:0.0 ---------------------------------------- SO: 7000000ea5c7680, type: 3, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (call) sess: cur 7000000ea580548, <strong><span style="color: #0000ff">rec 7000000ea578b08</span></strong>, usr 7000000ea580548; depth: 0 ---------------------------------------- SO: 7000000ea5c9a88, type: 3, owner: 7000000ea5c7680, flag: INIT/-/-/0x00 (call) sess: cur 7000000ea578b08, <strong><span style="color: #0000ff">rec 7000000ea578b08</span></strong>, usr 7000000ea580548; <strong><span style="color: #0000ff">depth: 1</span></strong> ---------------------------------------- SO: 7000000ea578b08, type: 4, owner: 7000000ea5c7680, flag: INIT/-/-/0x00 (session) sid: 538 trans: 0, creator: 0, flag: (2) -/REC -/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS temporary object counter: 0 ---------------------------------------- SO: 7000000e7842e80, type: 16, owner: 7000000ea3a7dd8, flag: INIT/-/-/0x00 (osp req holder) <span style="color: #0000ff"><strong> 从上面的rec 7000000ea578b08 我们可以判断为这个是一个递归操作,而后面的depth 1则是表示</strong></span> <span style="color: #0000ff"><strong> 递归操作的层次。</strong></span> 无奈之下,我猜测那个600错误中的字符串700000041fd2ce0 应该是个类似latch地址的东西,或者说是 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
一个资源的标示,<strong><span style="color: #0000ff">于是搜索字符串700000041fd2ce0,发现如下信息:</span></strong> *** 2011-07-11 12:52:52.561 GLOBAL CACHE ELEMENT DUMP (address: 700000041fd2c70): id1: 0x42c98 id2: 0xe0000 obj: 0 block: (14/273560) lock: NL rls: 0x0000 acq: 0x0000 latch: 3 flags: 0x41 fair: 0 recovery: 0 bscn: 0x0.0 bctx: 0 write: 0 scan: 0x2000000 xflg: 0 xid: 0x0.0.0 lcp: 0 lnk: [NULL] lch: [700000053f3aa30,700000053f3aa30] LIST OF BUFFERS LINKED TO THIS GLOBAL CACHE ELEMENT: flg: 0x08000000 state: SCURRENT mode: NULL addr: 700000053f3a930 obj: 47733 cls: DATA bscn: 0x2.cb164f84 GCS UNKNOWN <strong><span style="color: #0000ff">700000041fd2ce0</span></strong>,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d1a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 Dump of buffer cache at level 1 for tsn=9, rdba=58993816 BH (700000053f3a930) file#: 14 rdba: 0x03842c98 (14/273560) class: 1 ba: 700000053602000 set: 10 blksize: 8192 bsi: 0 set-flg: 0 pwbcnt: 0 dbwrid: 0 obj: 47733 objn: 47733 tsn: 9 afn: 14 hash: [7000000e1d2bb78,7000000e1d2bb78] lru: [700000014f61a38,70000003bf83688] lru-flags: hot_buffer ckptq: [NULL] fileq: [NULL] objq: [700000020f63830,700000052f14028] st: SCURRENT md: NULL tch: 0 le: 700000041fd2c70 <span style="color: #0000ff">flags: remote_transfer</span> LRBA: [0x0.0.0] HSCN: [0xffff.ffffffff] HSUB: [65535] GCS RESOURCE <strong><span style="color: #0000ff">7000000e4ce1530 </span></strong>hashq [7000000d01f0ff0,7000000e4493420] name[0x42c98.e0000] pkey 47733 grant 7000000e2190070 cvt 0 send 0,0 write 0,0@65535 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0 xid 0x0000.000.00000000 sid 0 pkey 47733 hv 89 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 12, dom 0] kjga st 0x4, step 0.0.0, cinc 16, rmno 1550, flags 0x0 lb 0, hb 0, myb 3475, drmb 3475, apifrz 0 GCS SHADOW 7000000e2190070,1272 <strong><span style="color: #0000ff">sq[700000041fd2ce0,7000000e4ce1550] resp[7000000e4ce1530,0x42c98.e0000] </span></strong>pkey 47733 grant 1 cvt 0 mdrole 0x21 st 0x40 <span style="color: #ff0000">GRANTQ rl LOCAL</span> master 1 owner 0 sid 0 remote[700000045fb5940,1518] hist 0x8208225f history 0x1f.0x9.0x2.0x2.0x2.0x2. cflag 0x0 sender 0 flags 0x0 replay# 0 disk: 0x0000.00000000 write request: 0x0000.00000000 pi scn: 0x0000.00000000 msgseq 0x3 updseq 0x0 reqids[1996,0,0] infop 0x0 GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d1a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dd6e9e10,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 147 owner 0 sid 0 remote[70206dc00,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 24 flags 0xc replay# 0 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045daa8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dd6e9318,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 148 owner 0 sid 0 remote[70206dbff,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 200 flags 0xc replay# 0 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d9a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dd6e94c8,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 118 owner 0 sid 0 remote[70206dbf6,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 8 flags 0xc replay# 0 disk: 0xdda6.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d0a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dda67608,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 158 owner 0 sid 0 remote[70206dbf5,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 96 flags 0xc replay# 0 disk: 0xdd4b.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045cfa8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 ABORTING <strong><span style="color: #ff0000">lms abort after exception 600</span></strong> KJC Communication Dump: state 0x5 flags 0x0 mode 0x0 inst 1 inc 16 nrcv 3 nsp 3 nrcvbuf 1000 reg_msg: sz 456 cur 229 (s:0 i:229) max 967 ini 1750 big_msg: sz 8240 cur 21 (s:0 i:21) max 529 ini 1750 rsv_msg: sz 8240 cur 0 (s:0 i:0) max 0 tot 1000 rcvr: id 0 orapid 6 ospid 1098058 rcvr: id 2 orapid 8 ospid 463194 rcvr: id 1 orapid 7 ospid 1044892 send proxy: id 0 ndst 1 (0:0 ) send proxy: id 2 ndst 1 (0:2 ) send proxy: id 1 ndst 1 (0:1 ) GES resource limits: ges resources: cur 0 max 0 ini 11712 ges enqueues: cur 0 max 0 ini 17740 ges cresources: cur 977 max 2335 gcs resources: cur 91982 max 185193 ini 185193 gcs shadows: cur 72232 max 135376 ini 185193 KJCTS state: seq-check:no timeout:yes waitticks:0x3 highload no GES destination context: Dest 0 rcvr 0 inc 16 state 0x10041 tstate 0x0 |
1 2 3 |
这里我发现GCS RESOURCE 7000000e4ce1530 这里的7000000e4ce1530 正是600错误中的错误字符串。 我们回头再来认真的看看上面的600错误,如下: ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], [] |
1 |
其中字符串代码,700000041FD2CE0 7000000E4CE1530 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
<strong><span style="color: #ff0000">*** 2011-07-11 06:46:53.780</span></strong> kjdrvalidRMno: msg type 34 from node 0 dropped FUSION MSG 11094cc68,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8] CLOSE [0x1794.80000, 47523] shadow [0,0] seq 0x2 act 1 client [700000026fd0078,1025] reqid 1369 ordered 0 grant 1 convert 0 role 0 pi [0x0.0x0] flags 0x0 state 0x20 disk scn 0x0.0 writereq scn 0x0.0 rreqid 0 msgRM# 1536 bkt# 3826 drmbkt# 3826 pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0 hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0] kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20 lb 3072, hb 4095, myb 3826, drmb 3826, apifrz 1 kjmvalidate: drm drop a message RMno 1536 from 0 type 34 mver 16 myver 16 seq 0.109921737 rseq 0.109921736 flag x6d kjdrvalidRMno: msg type 34 from node 0 dropped FUSION MSG 11094cce8,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8] CLOSE [0x178c.80000, 47523] shadow [0,0] seq 0x2 act 1 client [70000005bfc6010,1475] reqid 1921 ordered 0 grant 1 convert 0 role 0 pi [0x0.0x0] flags 0x0 state 0x20 disk scn 0x0.0 writereq scn 0x0.0 rreqid 0 msgRM# 1536 bkt# 3825 drmbkt# 3825 pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0 hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0] kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20 lb 3072, hb 4095, myb 3825, drmb 3825, apifrz 1 kjmvalidate: drm drop a message RMno 1536 from 0 type 34 mver 16 myver 16 seq 0.109921738 rseq 0.109921737 flag x6d kjdrvalidRMno: msg type 34 from node 0 dropped FUSION MSG 11094cd68,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8] CLOSE [0x17cb.80000, 47523] shadow [0,0] seq 0x2 act 1 client [70000001fff22f0,1158] reqid 1407 ordered 0 grant 1 convert 0 role 0 pi [0x0.0x0] flags 0x0 state 0x20 disk scn 0x0.0 writereq scn 0x0.0 rreqid 0 msgRM# 1536 bkt# 3833 drmbkt# 3833 pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0 hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0] kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20 lb 3072, hb 4095, myb 3833, drmb 3833, apifrz 1 kjmvalidate: drm drop a message RMno 1536 from 0 type 34 mver 16 myver 16 seq 0.109921739 rseq 0.109921738 flag x6d kjdrvalidRMno: msg type 34 from node 0 dropped FUSION MSG 11094cde8,34 from[0,1073492] ver[16,1536] ln 128 sq[0,8] CLOSE [0x17c3.80000, 47523] shadow [0,0] seq 0x2 act 1 client [70000005bff7d90,1289] reqid 1807 ordered 0 grant 1 convert 0 role 0 pi [0x0.0x0] flags 0x0 state 0x20 disk scn 0x0.0 writereq scn 0x0.0 rreqid 0 msgRM# 1536 bkt# 3832 drmbkt# 3832 pkey 47523, stat 5, masters[32767, 32767->1], reminc 0, RM# 0 flg 0x0 hv 47 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 14, dom 0] kjga st 0x4, step 0.34.0, cinc 16, rmno 1536, flags 0x20 lb 3072, hb 4095, myb 3832, drmb 3832, apifrz 1 kjmvalidate: drm drop a message RMno 1536 from 0 type 34 mver 16 myver 16 seq 0.109921740 rseq 0.109921739 flag x6d <strong><span style="color: #ff0000">*** 2011-07-11 12:52:48.432</span></strong> GCS UNKNOWN <strong><span style="color: #ff0000">700000041fd2ce0</span></strong>,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d1a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS RESOURCE <span style="color: #ff0000"><strong>7000000e4ce1530 </strong></span>hashq [7000000d01f0ff0,7000000e4493420] name[0x42c98.e0000] pkey 47733 grant 7000000e2190070 cvt 0 send 0,0 write 0,0@65535 flag 0x0 mdrole 0x1 mode 1 scan 0 role LOCAL disk: 0x0000.00000000 write: 0x0000.00000000 cnt 0x0 hist 0x0 xid 0x0000.000.00000000 sid 0 pkey 47733 hv 89 [stat 0x0, 1->1, wm 32767, RMno 0, reminc 12, dom 0] kjga st 0x4, step 0.0.0, cinc 16, rmno 1550, flags 0x0 lb 0, hb 0, myb 3475, drmb 3475, apifrz 0 GCS SHADOW 7000000e2190070,1272 sq[700000041fd2ce0,7000000e4ce1550] resp[7000000e4ce1530,0x42c98.e0000] pkey 47733 grant 1 cvt 0 mdrole 0x21 st 0x40 GRANTQ rl LOCAL master 1 owner 0 sid 0 remote[700000045fb5940,1518] hist 0x8208225f history 0x1f.0x9.0x2.0x2.0x2.0x2. cflag 0x0 sender 0 flags 0x0 replay# 0 disk: 0x0000.00000000 write request: 0x0000.00000000 pi scn: 0x0000.00000000 msgseq 0x3 updseq 0x0 reqids[1996,0,0] infop 0x0 GCS UNKNOWN 700000041fd2ce0,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 158 owner 0 sid 0 remote[70206dbf7,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 16 flags 0xc replay# -501677968 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d1a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dd6e9e10,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 147 owner 0 sid 0 remote[70206dc00,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 24 flags 0xc replay# 0 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045daa8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dd6e9318,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 148 owner 0 sid 0 remote[70206dbff,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 200 flags 0xc replay# 0 disk: 0xdd6e.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d9a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dd6e94c8,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 118 owner 0 sid 0 remote[70206dbf6,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 8 flags 0xc replay# 0 disk: 0xdda6.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045d0a8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 GCS UNKNOWN 7000000dda67608,0 sq[0,0] resp[0,0x0.0] pkey 0 grant 0 cvt 0 mdrole 0x0 st 0x0 GRANTQ rl LOCAL master 158 owner 0 sid 0 remote[70206dbf5,0] hist 0x6006c history 0x2c.0x1.0x20.0x1.0x0.0x0. cflag 0x0 sender 96 flags 0xc replay# 0 disk: 0xdd4b.07000000 write request: 0x05dc.00000089 pi scn: 0x0002.0045cfa8 msgseq 0x8 updseq 0x0 reqids[8,259,258] infop 0x0 ABORTING *** 2011-07-11 12:52:48.437 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kjbrchkmode], [24], [0x700000041FD2CE0], [0x7000000E4CE1530], [], [], [], [] |
1 |
从上看,我感觉似乎跟DRM有点关系,至于是不是一定有关系,我不敢肯定,隐隐的感觉有点关联。 |
1 |
<span style="color: #0000ff"><strong>到最后,我们有必要来回顾一下lms进程的作用,说的简单一点,该进程的作用就是管理集群节点之间数据block的访问,并在节点之间传递cache buffer</strong></span> |
1 |
<span style="color: #0000ff"><strong>的块镜像,,当lms进程跨节点请求数据时,为了确保数据的一致性,在同一时刻,保证只能有一个实例去更新一个数据块,该进程跟lmd进程协同作用,</strong></span> |
1 |
<span style="color: #0000ff"><strong>所以通常来说,lms出问题,lmd也会出问题。</strong></span> |
1 |
<span style="color: #0000ff"><strong>我们再回到这个问题上来,lms进程为啥为异常abort?从上面gcs信息来看,是由于请求的资源和所获取到的资源不一致导致,而又由于需要</strong></span> |
1 |
<span style="color: #0000ff"><strong>保证数据的一致性,进而导致lms abort。 正常情况下,7000000E4CE1530应该跟700000041FD2CE0一样才对。</strong></span> |
1 |
当然上面也是我的个人猜测,如果大家有什么高见,欢迎讨论。 |
1 2 |
最后我们给客户的答复是上面判断的那个bug,虽然有有点无奈,不过也没办法。关于该bug的详细描述如下: BOTH RAC INSTANCES CRASHED WITH ORA-600 [kjbrchkmode:last] [ID 987822.1] |
1 2 3 |
-------------------------------------------------------------------------------- 修改时间 27-AUG-2010 类型 PROBLEM 状态 PUBLISHED |
1 2 3 4 |
In this Document Symptoms Cause Solution |
1 |
-------------------------------------------------------------------------------- |
1 2 |
Symptoms RAC Instance crashed after the LMS process terminated with ORA-600 [kjbrchkmode:last], [60], [0x5F91E9390], [2], [], [], [], [] |
1 |
The corresponding LMS tracefile showed the following call stack: |
1 2 3 4 |
ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kjbrchkmode:last], [60], [0x5F91E9390], [2], [], [], [], [] ----- Call Stack Trace ----- ksedmp kgerinv kgeasnmierr kjbrchkmode kjbmprlst kjmxmpm kjmpmsgi kjmsm ksbrdp opirip opidrv sou2o opimai_real main |
1 2 3 |
Cause The issue is caused by: BUG 6239052 INSTANCE CRASH DUE TO ORA-600 [KJBRCHKMODE LAST] |
1 |
The fix for bug 6239052 is included in patchset 10.2.0.4, and 11g onwards. |
1 |
Leave a Reply
You must be logged in to post a comment.