某客户RAC由于掉电导致系统崩溃的恢复过程
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 某客户RAC由于掉电导致系统崩溃的恢复过程
这里简单记录一下,此次国庆加班恢复的某客户的2套Oracle RAC数据库,整个恢复过程中,2套rac差不多,因此这里以其中一套数据库的恢复过程为例进行简单分析说明。数据库由于为非归档,由于掉电导致重启之后系统无法正常open。
在正常open的过程中,报错如下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
SQL> alter database open; alter database open * ERROR at line 1: ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [], [], [] SQL> shutdown immediate ORA-01109: database not open Database dismounted. ORACLE instance shut down. SQL> startup ORACLE instance started. Total System Global Area 4294967296 bytes Fixed Size 2089576 bytes Variable Size 2751466904 bytes Database Buffers 1526726656 bytes Redo Buffers 14684160 bytes Database mounted. ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [], [], [] |
对于该错误,网上的解决方法也很多,可惜都不管用。这种情况之下,往往都是需要强制打开数据库的,首先需要做一个不完全恢复,如下:
1 2 3 4 5 6 7 8 9 10 11 12 |
SQL> recover database ORA-00279: change 236912204 generated at 09/29/2015 12:49:13 needed for thread 1 ORA-00289: suggestion : /xxxx/1_5112_877094801.dbf ORA-00280: change 236912204 for thread 1 is in sequence #5112 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} auto ORA-00308: cannot open archived log '/xxxx/1_5112_877094801.dbf' ORA-27037: unable to obtain file status IBM AIX RISC System/6000 Error: 2: No such file or directory Additional information: 3 |
在进行相关操作之后,我备份了一下当前的控制文件信息,便于后面如果有问题,方便处理。强制open的过程中,发现报如下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Sat Oct 3 11:49:31 2015 Instance recovery: looking for dead threads Instance recovery: lock domain invalid but no dead threads Sat Oct 3 11:49:33 2015 Errors in file /oracle/oracle/admin/cwdb/udump/cwdb1_ora_6029586.trc: ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], [] Sat Oct 3 11:49:34 2015 Errors in file /oracle/oracle/admin/xxxx/udump/xxxx1_ora_6029586.trc: ORA-00600: internal error code, arguments: [kclchkblk_4], [1], [18446744072394632417], [1], [18446744072392296306], [], [], [] Sat Oct 3 11:49:34 2015 Error 600 happened during db open, shutting down database USER: terminating instance due to error 600 Instance terminated by USER, pid = 6029586 |
这个错误已经处理过多次了。同样,百度一下,会发现很多人都写过相关的文章,包括Oracle mos的文章解释也是说这是临时块的scn过大导致,通过drop tempfile即可绕过该问题。实际上,这种情况之下,根本不会起作用。
但是不管如何,这个问题很明显都是跟block的scn有关系。既然是跟scn有关系,那么处理就不难了,通过推进scn即可。
通过推进scn 之后,再次open resetlogs成功打开数据库,可惜的是alert log报了一堆错误,如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
Sat Oct 3 13:10:34 2015 Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc: ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], [] Opening with internal Resource Manager plan where NUMA PG = 1, CPUs = 40 Sat Oct 3 13:10:35 2015 ORACLE Instance xxxx1 (pid = 25) - Error 600 encountered while recovering transaction (23, 85). Sat Oct 3 13:10:35 2015 Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc: ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], [] Sat Oct 3 13:10:35 2015 Trace dumping is performing id=[cdmp_20151003131035] Sat Oct 3 13:10:35 2015 replication_dependency_tracking turned off (no async multimaster replication found) Sat Oct 3 13:10:36 2015 Instance recovery: looking for dead threads Instance recovery: lock domain invalid but no dead threads Sat Oct 3 13:10:36 2015 Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc: ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], [] Sat Oct 3 13:10:37 2015 Starting background process QMNC Sat Oct 3 13:10:37 2015 ORACLE Instance xxxx1 (pid = 25) - Error 600 encountered while recovering transaction (23, 85). Sat Oct 3 13:10:37 2015 Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_smon_10420246.trc: ORA-00600: internal error code, arguments: [4137], [], [], [], [], [], [], [] QMNC started with pid=53, OS id=7536816 Sat Oct 3 13:10:41 2015 LOGSTDBY: Validating controlfile with logical metadata Sat Oct 3 13:10:41 2015 LOGSTDBY: Validation complete Sat Oct 3 13:10:46 2015 Errors in file /oracle/oracle/admin/xxxx/bdump/xxxx1_mmon_9110004.trc: ORA-00600: internal error code, arguments: [qertbFetchByRowID], [], [], [], [], [], [], [] Sat Oct 3 13:10:48 2015 Errors in file /oracle/oracle/admin/xxxx/udump/xxxx1_ora_6619434.trc: ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [] Completed: alter database open resetlogs |
这部分错误处理其实都不难。对于第一个ora-00600 [4137] 错误,很明显这是跟undo有关系的,其中(23,85)中的23表现第23号回滚段;通过屏蔽第23号回滚段可以很容易解决该错误,当然,这会儿导致事务的不一致性,这是没办法的,已经undo异常,Oracle 已经没有办法进行正常的事务恢复了。
其次,对于第2个ora-00600 [qertbFetchByRowID] 错误,处理也很简单,其大致意思是通过rowid访问获取数据有异常,很明显这是跟index有关系,通过重建index 可以解决该问题,其次最后一个[kdsgrp1] 错误就更常见了,通常也是Index的问题,重建即可。
1 |
看上去一切的恢复过程都很简单,很顺利,然而这里真正的难题,真正的问题才开始。 |
1 |
也就是最后一个看似很简单的错误ora-00600 [kdsgrp1]错误,对我们产生了极大的困难。首先我们来看下产生该错误时涉及到那些对象: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
<pre class="brush:plain">Validate domain 0 Validated domain 0, flags = 0x0 kwqmnich: current time:: 13: 31: 34 kwqmnich: instance no 0 check_only flag 1 kwqmnich: initialized job cache structure row 0041edda.2e continuation at file# 1 block# 126426 slot 47 not found ************************************************** KDSTABN_GET: 0 ..... ntab: 1 curSlot: 47 ..... nrows: 175 ************************************************** *** 2015-10-03 13:31:40.864 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [] Current SQL statement for this session: SELECT OWNER,NAME,TYPE,COUNT(*) FROM DBA_SOURCE WHERE SUBSTR(OWNER,1,4)='FMIS' GROUP BY OWNER,NAME,TYPE HAVING COUNT(*)>1000 ----- PL/SQL Call Stack ----- Object id on Block? Y seg/obj: 0x12 csc: 0x01.b1957474 itc: 3 flg: O typ: 1 - DATA fsl: 0 fnx: 0x0 ver: 0x01 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x0009.00e.0003968d 0x0c07f50f.1ea7.5d --U- 1 fsc 0x0053.b1957475 0x02 0x0017.005.0006755c 0x0c0774de.1be3.38 C--- 0 scn 0x0001.b1957451 0x03 0x0002.042.00010d39 0x0080b268.0cf1.22 C--- 0 scn 0x0001.b1957429 data_block_dump,data header at 0x110f46074 =============== tsiz: 0x3f88 hsiz: 0x170 pbl: 0x110f46074 bdba: 0x0041edda 76543210 flag=-------- ntab=1 nrow=175 frre=0 fsbo=0x170 fseo=0x783 avsp=0x3cbe tosp=0x3d13 0xe:pti[0] nrow=175 offs=0 0x12:pri[0] sfll=1 0x14:pri[1] sfll=2 |
1 |
我们可以发现,除开其他的非核心对象之后,这里还涉及到一个obj#=18,也就是obj$ 这个核心的数据字典表。而该数据字典表上的几个Index, |
1 |
i_obj1,i_obj2,i_obj3 都是object_id 小于57的核心对象,这部分对象是属于bootstrap$ 的核心数据字典对象。即是Index也无法通过 |
1 |
rebuild,38003 event或在upgrade模式下进行重建。 |
1 |
当然,这里也不是说完全无法去重建上述数据字典表,我后面有一篇文章会相信讲解如何去重建。 |
1 |
在分析过程中,我发现其中的前面2个Index都有问题,如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<pre class="brush:plain">SQL> analyze table obj$ validate structure; Table analyzed. SQL> select index_name from dba_indexes where table_name='OBJ$'; INDEX_NAME ------------------------------ I_OBJ1 I_OBJ2 I_OBJ3 SQL> analyze index I_OBJ1 validate structure; analyze index I_OBJ1 validate structure * ERROR at line 1: ORA-08100: index is not valid - see trace file for diagnostics SQL> analyze index I_OBJ2 validate structure; analyze index I_OBJ2 validate structure * ERROR at line 1: ORA-08100: index is not valid - see trace file for diagnostics SQL> analyze index I_OBJ3 validate structure; Index analyzed. |
1 |
不过这里我们也要注意的时,虽然前面2个index都有问题,然而上述错误产生时涉及到的index并不是2个都用到了,其实只是用到了第一个index就 |
1 |
报错ora-00600错误了。由于客户想通过expdp schema的方式去导出数据,然而发现执行时报错ora-00600 [kdsgrp1],包括exp执行时也报该 |
1 |
错误,不过exp tables的方式,不会报错;由于对象太多,将近50万个对象(包括表,index以及其他)。很明显,只能通过用户级别的导出。 |
1 |
那么也就在意味着我们必须修复这个错误才。 |
1 |
通过dump 相关的block,我们发现错误是很奇怪的,如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
<pre class="brush:plain">*** SESSION ID:(1052.1243) 2015-10-03 17:56:59.784 Block Checking: DBA = 4684328, Block Type = KTB-managed data block *** previous block dba dba: 477a2adoes not match my previous block dba dba: 477a26 **** row 0: key out of order ---- end index block validation ---- for block 4684328 = 0x00477a28 Block header dump: 0x00477a28 Object id on Block? Y seg/obj: 0x24 csc: 0x02.80026908 itc: 19 flg: O typ: 2 - INDEX fsl: 0 fnx: 0x477a1c ver: 0x01 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x0015.01f.00020a0a 0x00867c38.0862.04 C--- 0 scn 0x0001.83f32656 0x02 0x002e.044.00002415 0x0082989a.0477.52 C--- 0 scn 0x0001.74d90109 0x03 0x002c.00a.000012c5 0x0b804b78.012b.70 C--- 0 scn 0x0001.74d9011c 0x04 0x0002.001.00003a72 0x00861aab.024b.0a C--- 0 scn 0x0001.74fa8cf1 0x05 0x002f.02e.000065eb 0x0b816866.0217.5b C--- 0 scn 0x0001.74fb0205 0x06 0x0026.011.00002000 0x0b814f02.016d.14 C--- 0 scn 0x0001.74fc3dbf 0x07 0x0006.011.00003b1c 0x0084c65b.01f9.10 C--- 0 scn 0x0001.74fde478 0x08 0x0029.05f.00003258 0x0084faa6.017c.2f C--- 0 scn 0x0001.74fdedad 0x09 0x001b.023.00003fe6 0x008a8718.025b.3b C--- 0 scn 0x0001.74fdfedc 0x0a 0x001b.006.00004057 0x0b804f24.025c.33 C--- 0 scn 0x0001.751a25ed 0x0b 0x001b.02a.00004039 0x0b804f24.025c.49 C--- 0 scn 0x0001.751a2609 0x0c 0x001b.024.0000404f 0x0b804f24.025c.63 C--- 0 scn 0x0001.751a2627 0x0d 0x000a.018.00021b0b 0x0b821c8b.04cc.45 C--- 0 scn 0x0001.751a263f 0x0e 0x000a.055.00021b0a 0x0b821c8b.04cc.5e C--- 0 scn 0x0001.751a265c 0x0f 0x000a.031.00021aec 0x0b821c8c.04cc.09 C--- 0 scn 0x0001.751a2678 0x10 0x0022.05e.00001dbd 0x008a88df.01e0.2e C--- 0 scn 0x0001.74d8ff09 0x11 0x0025.010.00001ead 0x00860446.01ac.20 C--- 0 scn 0x0001.74d8ff7d 0x12 0x002b.043.00001228 0x0084dc65.0164.11 C--- 0 scn 0x0001.74d8ffe2 0x13 0x001f.015.00001de8 0x0b81a76a.0183.4e C--- 0 scn 0x0001.74d90071 Leaf block dump =============== header address 504403185228956148=0x70000064725c1f4 kdxcolev 0 KDXCOLEV Flags = - - - kdxcolok 0 kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y kdxconco 1 kdxcosdc 2 kdxconro 0 kdxcofbo 36=0x24 kdxcofeo 15816=0x3dc8 kdxcoavs 15780 kdxlespl 0 kdxlende 0 kdxlenxt 4684330=0x477a2a kdxleprv 4684326=0x477a26 kdxledsz 6 kdxlebksz 15816 *** dummy key *** row#0[9549] flag: ---D--, lock: 0, len=14, data:(6): 00 44 03 10 00 7c col 0; len 5; (5): c4 04 24 25 2f ----- end of leaf block dump ----- dumping parent of corrupted subtree, row # = 1202 dumping parent of corrupted subtree, row # = 3 |
1 |
对于上述这个index 的错误,我是第一次遇见,跟老熊讨论了一下,他认为可能是index split情况之下出现的。在远程时,我也用过bbed |
1 |
对前后将近10个index block进行了分析,通过比较index 的链表,发现确实不匹配。 |
1 |
对于这种情况之下,想通过bbed去修复 index,难度可想而知,因此果断放弃这种方式。最后无奈之下,只能通过处理数据字典表 |
1 |
的方式来处理掉i_obj1,i_obj2 这2个index。最后再让客户进行exp 用户级别的导出,只不过这个导出的时间比较漫长了。 |
2 Responses to “某客户RAC由于掉电导致系统崩溃的恢复过程”
recover database的时候没有选择online redo吗?
客户之前已经recover了N次了。。
Leave a Reply
You must be logged in to post a comment.