Oracle 11.2.0.3.9 RAC is hang (for Linux)
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
一个朋友的库出现异常,6月30号就出现hang的问题,当时也给了建议直接kill会话解决了,没想到现在又出现了。简单的帮忙分析了一下原因,如下是alert log的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Mon Jul 14 18:50:06 2014 Archived Log entry 3689 added for thread 1 sequence 2223 ID 0x1db958b2 dest 1: Mon Jul 14 19:40:29 2014 Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmhb_17008.trc (incident=224129): ORA-00445: background process "GCR0" did not start after 120 seconds Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_224129/xxxx1_lmhb_17008_i224129.trc Mon Jul 14 19:40:30 2014 Dumping diagnostic data in directory=[cdmp_20140714194030], requested by (instance=1, osid=17008 (LMHB)), summary=[incident=224129]. Mon Jul 14 19:40:46 2014 LCK0 (ospid: 17064) waits for latch 'shared pool' for 169 secs. Mon Jul 14 19:41:29 2014 PMON failed to acquire latch, see PMON dump Mon Jul 14 19:42:01 2014 Errors in file /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/trace/xxxx1_lmhb_17008.trc (incident=224130): ORA-29771: process USER (OSID 4221) blocks LCK0 (OSID 17064) for more than 70 seconds Incident details in: /oracle/app/oracle/diag/rdbms/xxxx/xxxx1/incident/incdir_224130/xxxx1_lmhb_17008_i224130.trc Mon Jul 14 19:42:29 2014 PMON failed to acquire latch, see PMON dump Mon Jul 14 19:43:10 2014 USER (ospid: 4221) is blocking LCK0 (ospid: 17064) in a wait LMHB (ospid: 17008) kills USER (ospid: 4221). |
从上面的alert log信息来看,可以看到GCR0进程启动失败,且核心进程LCK0也在等待latch:shared pool. 更重要的一点是:
连pmon进程也无法获得latch了。很明显,这个数据库hang住了。
对于Oracle hang问题,大家所了解做法是,先登录数据库,然后找到阻塞源头kill会话。
那么如果数据库hang住了之后,怎么登录呢?对于Oracle 10g以及之后的版本可以通过sqlplus -prelim / as sysdba方式登录.
然后进行hanganalyze dump,可惜是朋友这里操作了之后看到任何信息,那么只能进行systemstate dump和processstate dump了。
首先我们来看下PMON进程在等待为什么,为什么无法获得latch ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
PROCESS 2: PMON ---------------------------------------- SO: 0x55ccf1b98, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x55ccf1b98, name=process, file=ksu.h LINE:12616, pg=0 (process) Oracle pid:2, ser:1, calls cur/top: 0x55d7c76b0/0x55d7c76b0 flags : (0xe) SYSTEM flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 146 last post received-location: kjm.h LINE:1248 ID:kjfmHealth_Latch_killHolder: post pmon after latch holder kill last process to post me: 544d25368 1 2 last post sent: 0 0 52 last post sent-location: ksv2.h LINE:1682 ID:ksvpst: cldel - post master last process posted by me: 544d25368 1 2 (latch info) wait_event=0 bits=0 Location from where call was made: kgh.h LINE:6483 ID:kghfre: Chunk Header Context saved from call: 22949992976 waiting for 601091f8 Child shared pool level=7 child#=2 Location from where latch is held: kgh.h LINE:6463 ID:kghalo: Context saved from call: 0 state=busy [holder orapid=146] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 205 (1830, 1405339770, 1769) 30 (1829, 1405339770, 1650) 208 (1793, 1405339770, 328) 66 (1770, 1405339770, 1770) 104 (1743, 1405339770, 1743) 16 (1536, 1405339770, 1536) 96 (1489, 1405339770, 1489) 58 (270, 1405339770, 270) waiter count=8 gotten 2181033884 times wait, failed first 7066852 sleeps 865327 gotten 215950 times nowait, failed: 464 possible holder pid = 146 ospid=13135 Process Group: DEFAULT, pseudo proc: 0x540f7d060 O/S info: user: oracle, term: UNKNOWN, ospid: 16945 OSD pid info: Unix process pid: 16945, image: oracle@itvxxxx1 (PMON) |
可以看到pmon进程无法获得shared pool latch,而该资源(601091f8)正在被orapid=146所持有。下面来看下orapid 146在干什么?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
PROCESS 146: ---------------------------------------- SO: 0x55cd17458, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x55cd17458, name=process, file=ksu.h LINE:12616, pg=0 (process) Oracle pid:146, ser:39, calls cur/top: 0x26c90ee80/0x26c9099e0 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 259 last post received-location: kgl.h LINE:8714 ID:kgllkdl: post after freeing latch last process to post me: 548d29e00 4 0 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:285 ID:ksasnd last process posted by me: 55ccf6f08 1 6 (latch info) wait_event=0 bits=80 holding (efd=17) 601091f8 Child shared pool level=7 child#=2 Location from where latch is held: kgh.h LINE:6463 ID:kghalo: Context saved from call: 0 state=busy [holder orapid=146] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 205 (2115, 1405340055, 2054) 30 (2114, 1405340055, 1935) 208 (2078, 1405340055, 162) 66 (2055, 1405340055, 2055) 104 (2028, 1405340055, 2028) 16 (1821, 1405340055, 1821) 96 (1774, 1405340055, 1774) 58 (555, 1405340055, 555) 65 (255, 1405340055, 255) waiter count=9 Process Group: DEFAULT, pseudo proc: 0x540f7d060 O/S info: user: grid, term: UNKNOWN, ospid: 13135 OSD pid info: Unix process pid: 13135, image: oracle@itvxxxx1 |
ok,我们看到该进程正在持有 601091f8这个Child shared pool。朋友这里单独对orapid=146这个会话进行了dump,下面来深入分析一下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
SO: 0x55cd17458, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x55cd17458, name=process, file=ksu.h LINE:12616, pg=0 (process) Oracle pid:146, ser:39, calls cur/top: 0x26c90ee80/0x26c9099e0 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 0 0 259 last post received-location: kgl.h LINE:8714 ID:kgllkdl: post after freeing latch last process to post me: 548d29e00 4 0 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:285 ID:ksasnd last process posted by me: 55ccf6f08 1 6 (latch info) wait_event=0 bits=80 holding (efd=17) 601091f8 Child shared pool level=7 child#=2 Location from where latch is held: kgh.h LINE:6463 ID:kghalo: Context saved from call: 0 state=busy [holder orapid=146] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 205 (3715, 1405341655, 3654) 30 (3714, 1405341655, 3535) 208 (3678, 1405341655, 406) 66 (3655, 1405341655, 3655) 104 (3628, 1405341655, 3628) 16 (3421, 1405341655, 3421) 96 (3374, 1405341655, 3374) 58 (2155, 1405341655, 2155) 65 (1855, 1405341655, 1855) 64 (956, 1405341655, 956) 69 (551, 1405341655, 551) waiter count=11 Process Group: DEFAULT, pseudo proc: 0x540f7d060 O/S info: user: grid, term: UNKNOWN, ospid: 13135 OSD pid info: Unix process pid: 13135, image: oracle@itvxxxx1 。。。。。。。。。 。。。。。。。。。 SO: 0x55d09c5e0, type: 4, owner: 0x55cd17458, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x55cd17458, name=session, file=ksu.h LINE:12624, pg=0 (session) sid: 385 ser: 3787 trans: (nil), creator: 0x55cd17458 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) oct: 3, prv: 0, sql: 0x54f4c5250, psql: 0x4bc1de318, user: 34/ITVMANAGER ksuxds FALSE at location: 0 service name: xxxx client details: O/S info: user: root, term: unknown, ospid: 1234 machine: localhost.localdomain program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: 0: waiting for 'library cache: mutex X' idn=0x1a43f94e, value=0x76b00000000, where=0x4f wait_id=7745 seq_num=7746 snap_id=1 wait times: snap=1 min 38 sec, exc=1 min 38 sec, total=1 min 38 sec wait times: max=infinite, heur=61 min 56 sec wait counts: calls=8957 os=8957 in_wait=1 iflags=0x15b2 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 1899, ser: 35707 Dumping final blocker: inst: 1, sid: 2654, ser: 39173 |
可以看到该进程正在等待library cache: mutex X,而从最后的几行信息可以看到提示,fina blocker为:
inst: 1, sid: 2654, ser: 39173 (注意:11gR2之前这个信息是没有的,可见11gR2比较强大了)
既然这里提到最终的blocker 会话是2654,那么我们搜索关键字:sid: 2654 找到如下信息:
1 2 3 4 5 6 7 8 |
SO: 0x5417cacb0, type: 4, owner: 0x55cd163a8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x55cd163a8, name=session, file=ksu.h LINE:12624, pg=0 (session) sid: 2654 ser: 39173 trans: (nil), creator: 0x55cd163a8 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x9) -/-/INC DID: , short-term DID: txn branch: (nil) oct: 3, prv: 0, sql: 0x34f05aa78, psql: 0x29ac2e878, user: 34/xxxxxx |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
ksuxds FALSE at location: 0 service name: xxxx client details: O/S info: user: root, term: unknown, ospid: 1234 machine: localhost.localdomain program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: 0: waiting for 'latch: shared pool' address=0x60109298, number=0x133, tries=0x0 wait_id=1048 seq_num=1049 snap_id=1 wait times: snap=42 min 12 sec, exc=42 min 12 sec, total=42 min 12 sec wait times: max=infinite, heur=42 min 12 sec wait counts: calls=0 os=0 in_wait=1 iflags=0x2520 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 1716, ser: 19027 Dumping final blocker: inst: 1, sid: 1716, ser: 19027 There are 57 sessions blocked by this session. |
这里又提示该进程的最终blocker为 sid 1716。而2654 会话等待等待latch: shared pool。 下面继续搜索sid: 1716
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
SO: 0x541505e80, type: 4, owner: 0x540d8abf0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x540d8abf0, name=session, file=ksu.h LINE:12624, pg=0 (session) sid: 1716 ser: 19027 trans: (nil), creator: 0x540d8abf0 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) oct: 3, prv: 0, sql: 0x55ba986f0, psql: 0x34ea4cf40, user: 34/xxxxxxxx ksuxds FALSE at location: 0 service name: xxxx client details: O/S info: user: root, term: unknown, ospid: 1234 machine: localhost.localdomain program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: 0: waiting for 'library cache: mutex X' idn=0x1a43f94e, value=0x76b00000000, where=0x4f wait_id=32063 seq_num=32067 snap_id=1 wait times: snap=5 min 22 sec, exc=5 min 22 sec, total=5 min 22 sec wait times: max=infinite, heur=41 min 33 sec wait counts: calls=29210 os=29210 in_wait=1 iflags=0x15b2 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 1899, ser: 35707 Dumping final blocker: inst: 1, sid: 1899, ser: 35707 There are 56 sessions blocked by this session. Dumping one waiter: inst: 1, sid: 2654, ser: 39173 wait event: 'latch: shared pool' p1: 'address'=0x60109298 p2: 'number'=0x133 p3: 'tries'=0x0 row_wait_obj#: 19692, block#: 3852, row#: 0, file# 50 min_blocked_time: 2483 secs, waiter_cache_ver: 11844 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 |
可以看到sid 1716 在等待library cache: mutex X,而该会话的最终blocker为 sid:1899,继续搜索sid:1899
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
SO: 0x54553f600, type: 4, owner: 0x55cd152f8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x55cd152f8, name=session, file=ksu.h LINE:12624, pg=0 (session) sid: 1899 ser: 35707 trans: (nil), creator: 0x55cd152f8 flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) oct: 3, prv: 0, sql: 0x555f6e3b0, psql: 0x55a0664c0, user: 34/xxxxxxx ksuxds FALSE at location: 0 service name: xxxx client details: O/S info: user: root, term: unknown, ospid: 1234 machine: localhost.localdomain program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: 0: waiting for 'library cache: mutex X' idn=0x1d694212, value=0xa5e00000000, where=0x7f wait_id=928 seq_num=929 snap_id=1 wait times: snap=6 min 0 sec, exc=6 min 0 sec, total=6 min 0 sec wait times: max=infinite, heur=42 min 12 sec wait counts: calls=32681 os=32681 in_wait=1 iflags=0x15b2 There is at least one session blocking this session. Dumping 1 direct blocker(s): inst: 1, sid: 2654, ser: 39173 Dumping final blocker: inst: 1, sid: 2654, ser: 39173 There are 57 sessions blocked by this session. |
我们可以清楚的看到,最后又回到了sid:2654. 感觉饶了一圈又回来了。 如果你感觉这样分析有点费劲,那么可以通过
Oracle systemstate dump的格式化脚本将前面的systemstate dump的trace格式化,格式化之后,可以你会发现如下的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
Starting Systemstate 1 .............................................................................. .................. ............................................................. ............................................................................... ............................................................................... ... Ass.Awk Version 1.0.39 ~~~~~~~~~~~~~~~~~~~~~~ Source file : /soft/xxxx1_ora_23533.trc System State 1 (2014-07-14 20:09:30.961) ~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~ 1: [DEAD] 2: waiting for 'latch: shared pool' [Latch 601091f8] 3: waiting for 'rdbms ipc message' 4: waiting for 'VKTM Logical Idle Wait' 。。。。。。。 。。。。。。。 Resource Holder State Latch 601091f8 146: 146: is waiting for 138: Latch 60109298 105: 105: is waiting for 138: PIN: Handle=0x55b1664c8 ??? Blocker Latch 60109158 97: 97: is waiting for 138: Enq PV-00000000-00000000 16: 16: is waiting for 146: Latch 53330ec30 54: 54: is waiting for Rcache object=0x491dfa5e0 Rcache object=0x491dfa5e0 ??? Blocker Mutex 4f3f1fc3 97: 97: is waiting for 138: Mutex 1a43f94e 138: 138: is waiting for 142: LOCK: Handle=0x54f06d310 138: 138: is waiting for 142: Mutex 1d694212 142: 142: is waiting for 105: Mutex b182f101 103: 103: is waiting for 105: Latch 6000a018 208: 208: is waiting for 146: Latch 60010f38 226: 226: is waiting for 97: |
我们把上面的信息简单整理一下,如下:
146(library cache: mutex X) 等待 138 (library cache: mutex X)
138(library cache: mutex X) 等待 142 (latch: shared pool)
142(latch: shared pool) 等待 105 (library cache: mutex X)
105(library cache: mutex X) 等待 138 (library cache: mutex X)
我们可以看出,很明显Oracle在这里形成了死锁,即library cache: mutex X 和 latch:shared pool构成了死锁.
当然这里如果是要解决问题,那么简单,将138,142,105 3个会话kill掉即可。
凭直觉我们可能就知道,这样的问题很可能是Oracle bug导致,注意朋友这里的环境已经是11.2.0.3.9了。
搜索了MOS,我认为90%的情况是这个Bug 17588480 – library cache mutex/shared pool latch deadlock (文档 ID 17588480.8)
可惜的是朋友这里没有做errorstack,无法准确定位到是不是这个bug,较为遗憾。
Leave a Reply
You must be logged in to post a comment.