database crash with ora-00494
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: database crash with ora-00494
database crash with ora-00494
1 |
群论坛一个网友的问题,原帖地址: |
http://databasefans.com/read.php?tid=273
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
###### alert 如下 ###### Thu Mar 15 22:59:38 2012 Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc: ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594' Thu Mar 15 22:59:40 2012 System State dumped to trace file /app/oracle/admin/cn3fsa/bdump/cn3fsa_arc1_299280.trc Thu Mar 15 22:59:41 2012 Thread 1 cannot allocate new log, sequence 167905 Checkpoint not complete Current log# 5 seq# 167904 mem# 0: /data02/oradata/cn3fsa/redo5a.rdo Current log# 5 seq# 167904 mem# 1: /data03/oradata/cn3fsa/redo5b.rdo Thu Mar 15 22:59:47 2012 Thread 1 advanced to log sequence 167905 (LGWR switch) Current log# 7 seq# 167905 mem# 0: /data04/oradata/cn3fsa/redo7a.rdo Current log# 7 seq# 167905 mem# 1: /data01/oradata/cn3fsa/redo7b.rdo Thu Mar 15 22:59:48 2012 Killing enqueue blocker (pid=524594) on resource CF-00000000-00000000 by killing session 550.1 Thu Mar 15 22:59:50 2012 Errors in file /app/oracle/admin/cn3fsa/bdump/cn3fsa_ckpt_524594.trc: ORA-00028: your session has been killed Thu Mar 15 22:59:50 2012 CKPT: terminating instance due to error 28 Instance terminated by CKPT, pid = 524594 Fri Mar 16 08:50:50 2012 Starting ORACLE instance (normal) sskgpgetexecname failed to get name LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 Autotune of undo retention is turned off. IMODE=BR ILAT =61 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.4.0. System parameters with non-default values: processes = 500 sessions = 555 |
1 2 3 4 5 6 7 |
从上面错误,我们可以判断是arch1无法获得CF锁,而阻塞者 524594 很可能是后台核心进行,以至于 被kill后导致实例crash了。 这里需要说明一下的是该机制是这样的: 当进行需要获取CF enqueue时,最大等待时间为900s,超过900s时,将强制kill阻塞进程。通过如下隐含 参数控制这个timeout. |
1 2 3 4 5 6 7 8 |
SQL> show parameter controlfile_enq NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ _controlfile_enqueue_dump boolean FALSE _controlfile_enqueue_holding_time integer 120 _controlfile_enqueue_timeout integer 900 _kill_controlfile_enqueue_blocker boolean TRUE |
1 2 3 |
下面我们来看看进程 524594 是什么进程? 通过trace我们可以发现,如下: |
1 |
Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT) |
1 2 3 4 5 |
我们可以发现,原来是ckpt进程被干掉了,导致数据库crash了,那么我们再进一步分析, 是什么进程kill ckpt进程了呢?从上面alert log来看,error是写到arch1 trace里面的, 所以这里我想应该是跟lgwr进程有关系。 从网友提供的trace来看,其中lgwr trace都是如下信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
*** 2012-03-15 22:53:53.488 Warning: log write time 600ms, size 7893KB *** 2012-03-15 22:54:00.669 Warning: log write time 540ms, size 7362KB *** 2012-03-15 22:54:18.592 Warning: log write time 640ms, size 4040KB *** 2012-03-15 22:54:19.182 Warning: log write time 590ms, size 7412KB *** 2012-03-15 22:54:30.110 Warning: log write time 910ms, size 538KB *** 2012-03-15 22:54:36.899 Warning: log write time 550ms, size 7595KB *** 2012-03-15 22:54:48.679 Warning: log write time 720ms, size 10229KB *** 2012-03-15 22:54:55.736 Warning: log write time 510ms, size 7491KB *** 2012-03-15 22:55:07.782 Warning: log write time 830ms, size 4178KB |
1 2 3 |
说明lgwr在不停写,且在等待。 从另外一个trace里面也可以看到如下信息: |
1 2 3 4 5 6 7 8 9 10 11 |
*** 2012-03-15 22:43:48.114 Waited for detached process: CKPT for 310 seconds: *** 2012-03-15 22:43:48.114 Dumping diagnostic information for CKPT: OS pid = 524594 *** 2012-03-15 22:44:00.687 Waited for detached process: CKPT for 320 seconds: *** 2012-03-15 22:44:00.687 Dumping diagnostic information for CKPT: OS pid = 524594 |
1 2 |
我们知道检查点也是在不停的工作的,如果logfile切换频繁很可能会导致这个情况。 那么最后我们再来看看ckpt进程的状态信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
*** 2012-03-15 22:59:38.881 Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6) Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4) DUMP LOCAL BLOCKER: initiate state dump for KILL BLOCKER possible owner[7.524594] on resource CF-00000000-00000000 Dumping process info of pid[7.524594] requested by pid[17.299280] ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 524594' ------------------------------------------------------------------------------- ENQUEUE [CF] HELD FOR TOO LONG enqueue holder: 'inst 1, osid 524594' Process 'inst 1, osid 524594' is holding an enqueue for maximum allowed time. The process will be terminated. Oracle Support Services triaging information: to find the root-cause, look at the call stack of process 'inst 1, osid 524594' located below. Ask the developer that owns the first NON-service layer in the stack to investigate. Common service layers are enqueues (ksq), latches (ksl), library cache pins and locks (kgl), and row cache locks (kqr). Dumping process 7.524594 info: *** 2012-03-15 22:59:38.968 Dumping diagnostic information for ospid 524594: OS pid = 524594 loadavg : 1.38 1.97 2.29 swap info: free_mem = 34.10M rsv = 64.00M alloc = 7237.58M avail = 16384.00M swap_free = 9146.42M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 T oracle 524594 1 0 60 20 c1d0d590 108076 Mar 12 - 1:35 ora_ckpt_cn3fsa open: Permission denied 524594: ora_ckpt_cn3fsa 0x000000010025d950 ksfdaio1(??, ??, ??) + 0x330 0x000000010022c988 kcflbi(??, ??, ??, ??, ??, ??, ??, ??) + 0x6a8 0x00000001010f0970 kcbldio(??, ??, ??) + 0x1eb0 0x00000001010f6468 kcblsltio(??, ??, ??, ??) + 0x548 0x00000001010e3b50 krhahra(??) + 0x310 0x00000001010e711c krhahini(??, ??, ??, ??) + 0x57c 0x00000001018fc4fc kcvcfsi(??, ??, ??, ??, ??, ??) + 0x31c 0x00000001018e1744 kcvucpb(??, ??, ??, ??, ??, ??, ??, ??) + 0x1a4 0x00000001018fa344 kcvcca(??, ??) + 0x104 0x000000010015e728 ksbcti(??, ??, ??) + 0x3c8 0x00000001001694fc ksbabs(??) + 0x25c 0x0000000100166554 ksbrdp() + 0x4b4 0x000000010430c09c opirip(??, ??, ??) + 0x3fc 0x0000000102d9a598 opidrv(??, ??, ??) + 0x458 0x000000010370b0b0 sou2o(??, ??, ??, ??) + 0x90 0x0000000100000870 opimai_real(??, ??) + 0x150 0x00000001000006d8 main(??, ??) + 0x98 0x0000000100000368 __start() + 0x98 *** 2012-03-15 22:59:40.936 ---------------------------------------- SO: 70000024c3b20e0, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=7, calls cur/top: 70000024f5da8f0/70000024f5da8f0, flag: (6) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 114 last post received-location: kcbbza last process to post me: 70000024c3b1900 1 6 last post sent: 0 0 9 last post sent-location: ksqrcl last process posted by me: 70000024c3b9700 106 2 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000024c42cbc0 O/S info: user: oracle, term: UNKNOWN, ospid: 524594 OSD pid info: Unix process pid: 524594, image: oracle@lgcenjfsa (CKPT) SO: 70000024f5934a0, type: 4, owner: 70000024c3b20e0, flag: INIT/-/-/0x00 (session) sid: 550 trans: 0, creator: 70000024c3b20e0, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0007-00000005, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS last wait for 'direct path write' blocking sess=0x0 seq=6421 wait_time=1 seconds since wait started=0 file number=33, first dba=1, block cnt=1 Dumping Session Wait History for 'direct path write' count=1 wait_time=1 file number=33, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=1 file number=34, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=0 file number=35, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=0 file number=36, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=0 file number=37, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=1 file number=38, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=0 file number=39, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=0 file number=3a, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=1 file number=3b, first dba=1, block cnt=1 for 'direct path write' count=1 wait_time=1 file number=3c, first dba=1, block cnt=1 service name: SYS$BACKGROUND waiting for 'rdbms ipc message' blocking sess=0x0 seq=6432 wait_time=0 seconds since wait started=0 timeout=11c, =0, =0 Dumping Session Wait History for 'rdbms ipc message' count=1 wait_time=52 timeout=11d, =0, =0 for 'rdbms ipc message' count=1 wait_time=150428 timeout=12c, =0, =0 for 'log buffer space' count=1 wait_time=4 =0, =0, =0 for 'control file sequential read' count=1 wait_time=18 file#=0, block#=3, blocks=1 for 'control file parallel write' count=1 wait_time=32552 files=3, block#=3, requests=3 for 'control file sequential read' count=1 wait_time=13 file#=0, block#=1, blocks=1 for 'control file parallel write' count=1 wait_time=14208 files=3, block#=1, requests=3 for 'control file parallel write' count=1 wait_time=16303 files=3, block#=f, requests=3 for 'control file parallel write' count=1 wait_time=13450 files=3, block#=11, requests=3 for 'control file parallel write' count=1 wait_time=22983 files=3, block#=1a, requests=3 ---------------------------------------- SO: 70000024f682f80, type: 5, owner: 70000024f5da8f0, flag: INIT/-/-/0x00 (enqueue) CF-00000000-00000000 DID: 0000-0007-00000004 lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2 res: 0x70000024c605120, mode: SS, lock_flag: 0x0 own: 0x70000024f5934a0, sess: 0x70000024f5934a0, proc: 0x70000024c3b20e0, prv: 0x70000024c605130 ---------------------------------------- |
1 |
我们接着来看看arc1 进程的信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
PROCESS 17: ---------------------------------------- SO: 70000024c3b4840, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=17, calls cur/top: 70000024f5dc778/70000024f5dc778, flag: (2) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 9 last post received-location: ksqrcl last process to post me: 70000024f3ab6a0 1 2 last post sent: 0 0 9 last post sent-location: ksqrcl last process posted by me: 70000024f3ab6a0 1 2 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000024c42cbc0 O/S info: user: oracle, term: UNKNOWN, ospid: 299280 OSD pid info: Unix process pid: 299280, image: oracle@lgcenjfsa (ARC1) SO: 70000024c59e2f8, type: 4, owner: 70000024c3b4840, flag: INIT/-/-/0x00 (session) sid: 541 trans: 0, creator: 70000024c3b4840, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0011-00000002, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS service name: SYS$BACKGROUND last wait for 'enq: CF - contention' blocking sess=0x0 seq=28186 wait_time=2929713 seconds since wait started=903 name|mode=43460004, 0=0, operation=0 Dumping Session Wait History for 'enq: CF - contention' count=1 wait_time=2929713 <== 这里的waitime其实是个累积值。 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929711 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929705 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929707 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929709 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929714 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929711 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929712 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929706 name|mode=43460004, 0=0, operation=0 for 'enq: CF - contention' count=1 wait_time=2929711 name|mode=43460004, 0=0, operation=0 ---------------------------------------- SO: 70000024f5dc778, type: 3, owner: 70000024c3b4840, flag: INIT/-/-/0x00 (call) sess: cur 70000024c59e2f8, rec 70000024c56b588, usr 70000024c59e2f8; depth: 0 ---------------------------------------- SO: 70000024f698a10, type: 5, owner: 70000024f5dc778, flag: INIT/-/-/0x00 (enqueue) CF-00000000-00000000 DID: 0001-0011-00000002 lv: 00 60 48 5e 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2 res: 0x70000024c605120, lock_flag: 0x0 own: 0x70000024c59e2f8, sess: 0x70000024c59e2f8, prv: 0x70000024f698a40 |
1 2 3 |
这里我们可以看到 0x70000024c605120 正是ckpt持有的resouce。 最后我们来看看lgwr进程信息: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
PROCESS 6: ---------------------------------------- SO: 70000024f3a8f40, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=6, calls cur/top: 70000024f5da630/70000024f5da630, flag: (6) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 24 last post received-location: ksasnd last process to post me: 70000024c3b1900 1 6 last post sent: 109 0 4 last post sent-location: kslpsr last process posted by me: 70000024f3af5a0 158 0 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 70000024c42cbc0 O/S info: user: oracle, term: UNKNOWN, ospid: 413940 OSD pid info: Unix process pid: 413940, image: oracle@lgcenjfsa (LGWR) ---------------------------------------- SO: 70000024f6831e0, type: 5, owner: 70000024f3a8f40, flag: INIT/-/-/0x00 (enqueue) RT-00000001-00000000 DID: 0000-0006-00000005 lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x2 res: 0x70000024c605450, mode: X, lock_flag: 0x0 own: 0x70000024c5a4e00, sess: 0x70000024c5a4e00, proc: 0x70000024f3a8f40, prv: 0x70000024c605460 (FOB) flags=2 fib=7000002487ef290 incno=0 pending i/o cnt=0 fname=/data04/oradata/cn3fsa/ctlcn3fsa3.ctl fno=2 lblksz=16384 fsiz=1204 (FOB) flags=2 fib=7000002487eeef0 incno=0 pending i/o cnt=0 fname=/data03/oradata/cn3fsa/ctlcn3fsa2.ctl fno=1 lblksz=16384 fsiz=1204 (FOB) flags=2 fib=7000002487eeb50 incno=0 pending i/o cnt=0 fname=/data02/oradata/cn3fsa/ctlcn3fsa1.ctl fno=0 lblksz=16384 fsiz=1204 ---------------------------------------- SO: 70000024c6c2338, type: 11, owner: 70000024f3a8f40, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: 70000024f3a8f40, event: 5, last message event: 13, last message waited event: 13, messages read: 1 channel: (70000024c6e7790) scumnt mount lock scope: 1, event: 15, last mesage event: 13, publishers/subscribers: 0/13, messages published: 1 ---------------------------------------- SO: 70000024c5a4e00, type: 4, owner: 70000024f3a8f40, flag: INIT/-/-/0x00 (session) sid: 551 trans: 0, creator: 70000024f3a8f40, flag: (51) USR/- BSY/-/-/-/-/- DID: 0001-0006-00000006, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS service name: SYS$BACKGROUND waiting for 'rdbms ipc message' blocking sess=0x0 seq=36384 wait_time=0 seconds since wait started=0 timeout=129, =0, =0 Dumping Session Wait History for 'control file sequential read' count=1 wait_time=4 file#=0, block#=16, blocks=1 for 'control file sequential read' count=1 wait_time=4 file#=0, block#=13, blocks=1 for 'control file sequential read' count=1 wait_time=4 file#=0, block#=12, blocks=1 for 'control file sequential read' count=1 wait_time=4 file#=0, block#=10, blocks=1 for 'control file sequential read' count=1 wait_time=5 file#=2, block#=1, blocks=1 for 'control file sequential read' count=1 wait_time=5 file#=1, block#=1, blocks=1 for 'control file sequential read' count=1 wait_time=8 file#=0, block#=1, blocks=1 for 'rdbms ipc message' count=1 wait_time=26490 timeout=12c, =0, =0 for 'control file sequential read' count=1 wait_time=12 file#=0, block#=16, blocks=1 for 'control file sequential read' count=1 wait_time=4 file#=0, block#=13, blocks=1 temporary object counter: 0 |
1 |
从trace前面的锁信息得出如下结果: |
1 2 3 4 5 |
Unable to get enqueue on resource CF-00000000-00000000 (ges mode req=3 held=6) Possible local blocker ospid=524594 sid=550 sser=1 time_held=674 secs (ges mode req=6 held=4) ckpt: hold 4, request 6 arch1: request 3,hold 6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
综上所述,得出如下结论: 1. arch1 等待 ckpt (为了获得CF enqueue); 2. lgwr 在等待 arch1 去写日志; 3. arch1 等待时间超过 900s (从上面来看其实仅仅才674s); 4. lgwr 将进程 ckpt kill (因为ckpt进程阻塞arch进程,而lgwr通知arch1进程写归档,那么就间接是ckpt阻塞了lgwr进程); 5. database crash。 从alert来看,是因为log file切换有点过于频繁了,时间短的就30s左右。 根据oracle mos文档Database Crashes With ORA-00494 [ID 753290.1]的建议是这样的: 1. 调整redo log file,报错切换频率维持在 20~30 min; 2. 对于online logfile大小,可以参考 v$instance_recovery.optimal_logfile_size. |
3 Responses to “database crash with ora-00494”
RightScale’s management tool on any platform is fantastic. The server templates really make deployments easy and manageable.
[…] http://www.killdb.com/2012/03/17/database-crash-with-ora-00494.html […]
For test!
Leave a Reply
You must be logged in to post a comment.