数据库重启之后无法open的恢复案例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 数据库重启之后无法open的恢复案例
这是某网友的维护的一套数据库,据说是正常重启之后就无法启动数据库了。那么我们先来看看日志是什么样的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
Errors in file /u01/app/oracle/admin/orcl/bdump/orcl1_p012_18165.trc: ORA-27090: Message 27090 not found; product=RDBMS; facility=ORA Linux-x86_64 Error: 4: Interrupted system call Additional information: 3 Additional information: 128 Additional information: 65536 ..... Errors in file /u01/app/oracle/admin/orcl/bdump/orcl1_p007_18153.trc: ORA-27090: Message 27090 not found; product=RDBMS; facility=ORA Linux-x86_64 Error: 4: Interrupted system call Additional information: 3 Additional information: 128 Additional information: 65536 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 ..... SMON: enabling cache recovery Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_8858.trc: ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [] ..... Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_8858.trc: ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [] ..... Error 704 happened during db open, shutting down database USER: terminating instance due to error 704 |
我们可以看到,节点1在9:48:52秒被强行终止重启了实例。而且我们还可以看出该节点从9:42开始就出现ORA-27090 错误。而该错误通常跟操作系统有关系,通过后面的Linux-x86_64 Error: 4: Interrupted system call 错误也验证了这一点。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
Thread 2 advanced to log sequence 334685 (LGWR switch) Current log# 4 seq# 334685 mem# 0: +DATA/orcl/onlinelog/group_4.log ..... Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 ..... Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_mmon_9401.trc: ORA-07445: Message 7445 not found; No message file for product=RDBMS, facility=ORA; arguments: [kgghteFindCB()+188] [SIGSEGV] [Address not mapped to object] [0x00000010B] ..... Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9475.trc: ORA-07445: exception encountered: core dump [kglsget()+490] [SIGSEGV] [Address not mapped to object] [0x000000008] [] [] ..... Error 0 in kwqmnpartition(), aborting txn ..... Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9943.trc: ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], [] ..... ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], [] ..... ORA-65535 encountered when generating server alert SMG-4131 ..... Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_9943.trc: ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], [] ..... Reconfiguration started (old inc 2, new inc 4) List of nodes: 0 1 Global Resource Directory frozen Communication channels reestablished * domain 0 valid = 1 according to instance 0 ...... Completed redo application ..... Errors in file /u01/app/oracle/admin/orcl/udump/orcl2_ora_10551.trc: ORA-00600: internal error code, arguments: [kggfaAllocFunc1], [], [], [], [], [], [], [] ..... Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_smon_9395.trc: ORA-00600: Message 600 not found; No message file for product=RDBMS, facility=ORA; arguments: [16659] [kqldtu] [D] [0] [65] ..... Redo Shipping Client Connected as PUBLIC -- Connected User is Valid ...... Non-fatal internal error happenned while SMON was doing cursor transient type cleanup. SMON encountered 1 out of maximum 100 non-fatal internal errors. ..... Trace dumping is performing id=[cdmp_20160802094934] ..... Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_smon_9395.trc: ORA-00600: Message 600 not found; No message file for product=RDBMS, facility=ORA; arguments: [16659] [kqldtu] [D] [0] [195] ..... Non-fatal internal error happenned while SMON was doing cursor transient type cleanup. SMON encountered 2 out of maximum 100 non-fatal internal errors. ...... Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_pmon_9349.trc: ORA-00474: Message 474 not found; No message file for product=RDBMS, facility=ORA ...... PMON: terminating instance due to error 474 ...... Errors in file /u01/app/oracle/admin/orcl/bdump/orcl2_lmon_9355.trc: ORA-00474: Message 474 not found; No message file for product=RDBMS, facility=ORA ..... System state dump is made for local instance System State dumped to trace file /u01/app/oracle/admin/orcl/bdump/orcl2_diag_9351.trc ...... Shutting down instance (abort) License high water mark = 72 ...... Trace dumping is performing id=[cdmp_20160802095002] ..... Instance terminated by PMON, pid = 9349 ..... Instance terminated by USER, pid = 11027 ..... Starting ORACLE instance (normal) |
这里我们无论是看节点1还是节点2的alert log日志都会发现,由于smon进程在进程事务恢复时失败之后,导致数据库实例最终宕掉。宕掉之后就再也无法正常启动了。很明显这是强行关库之后带来的蝴蝶效应。
这里我们来看看其中节点2的这个ORA-00600 [16559]是什么含义?
1 2 3 4 |
ERROR: ORA-600 [16659] [a] [b] [c] [d] [e] VERSIONS: versions 8.0 to 8.1 DESCRIPTION: We are attempting to update a tab$ row and fail to update the dictionary information correctly. FUNCTIONALITY: DATA DICTIONARY IMPACT: PROCESS FAILURE POSSIBLE DATA DICTIONARY INCONSISTENCY |
从解释来看,这是Oracle 数据字典表tab$出现了不一致的情况。比较郁闷的是,客户的dataguard也坏掉了,也是一样的错误。
那么看来只能进行恢复了。这里首先要明白,节点1的ora-00600 [16703]本质上来讲跟ora-00600 [16559]是一回事。
从具体的错误来看,Oracle在open时,进行bootstrap初始化的过程就失败了,因此报错ORA-00704: bootstrap process failure.
处理思路也很简单,我们首先通过10046 trace跟踪open的过程,来看看Oracle 在bootstrap初始化的时候在进行什么操作时报错的?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
PARSING IN CURSOR #2 len=106 dep=1 uid=0 oct=3 lid=0 tim=1435661277781458 hv=3628073639 ad='cfd74588' select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1 END OF STMT PARSE #2:c=999,e=676,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1435661277781453 ===================== ..... BINDS #2: kkscoacd Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0 kxsbbbfp=2b5f7bae1b00 bln=22 avl=02 flg=05 value=20 EXEC #2:c=999,e=1015,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1435661277798049 WAIT #2: nam='db file sequential read' ela= 1086 file#=1 block#=50 blocks=1 obj#=-1 tim=1435661277799253 WAIT #2: nam='db file sequential read' ela= 568 file#=1 block#=51 blocks=1 obj#=-1 tim=1435661277799931 WAIT #2: nam='db file sequential read' ela= 1804 file#=1 block#=26 blocks=1 obj#=-1 tim=1435661277801863 FETCH #2:c=0,e=3833,p=3,cr=3,cu=0,mis=0,r=0,dep=1,og=4,tim=1435661277801922 *** 2016-08-02 13:52:28.469 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [] Current SQL statement for this session: alter database open |
从上面的错误不难看出就是在访问tab$ 的时候报错的,而且是访问的obj#=20的这个对象。那么这个对象是什么呢?
1 2 3 4 5 6 7 8 9 10 |
SQL> select owner,object_name,object_type from dba_objects where object_id=20; OWNER OBJECT_NAME OBJECT_TYPE ------------------------------ ------------------------------ ------------------- SYS ICOL$ TABLE SQL> !oerr ora 1403 01403, 00000, "no data found" // *Cause: // *Action: |
根据我们的查询以及对ORA-00600 [16703],[1403],[20] 这个错误的理解,那么我这里可以大致判断这个错误后的几个数字的含义:
16703: 错误代码,表示数据字典基表存在不一致1403: 表示数据没找到或者不匹配,即not data found.20: 表示访问的对象号,即object_id.
同时我们从前面的10046 trace跟踪来看,报错的SQL语句访问了3个block,然后报错,分别是file 1 block 50,51,26。
这我们分别dump 上面的3个block发现其中block 51,26 的dump 内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---block 51 Object id on Block? Y seg/obj: 0x3 csc: 0x00.64205 itc: 2 flg: - typ: 2 - INDEX fsl: 0 fnx: 0x0 ver: 0x01 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x0001.008.0000000d 0x0080118a.0010.01 CB-- 0 scn 0x0000.000105c7 0x02 0x000a.01e.000005e0 0x008074fc.007a.36 --U- 1 fsc 0x0000.00064208 Leaf block dump =============== header address 185777756=0xb12be5c kdxcolev 0 KDXCOLEV Flags = - - - kdxcolok 0 kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y kdxconco 1 kdxcosdc 2 kdxconro 258 kdxcofbo 552=0x228 kdxcofeo 4484=0x1184 kdxcoavs 3932 kdxlespl 0 kdxlende 0 kdxlenxt 4194360=0x400038 kdxleprv 0=0x0 kdxledsz 8 kdxlebksz 8032 ... row#93[5724] flag: ------, lock: 2, len=14, data:(8): 00 40 38 8e 00 06 02 00 col 0; len 3; (3): c2 03 03 ...... ---block 26 Block header dump: 0x0040001a Object id on Block? Y seg/obj: 0x2 csc: 0x05.e0568950 itc: 2 flg: - typ: 1 - DATA fsl: 0 fnx: 0x0 ver: 0x01 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x0000.05f.00000002 0x00400012.0001.16 C--- 0 scn 0x0000.000000fd 0x02 0x000b.00d.00048164 0x0100982b.b8bd.4d --U- 7 fsc 0x039f.e0568c3e data_block_dump,data header at 0xb12be5c =============== tsiz: 0x1fa0 hsiz: 0x140 pbl: 0x0b12be5c bdba: 0x0040001a 76543210 flag=-------- ntab=6 nrow=141 frre=-1 fsbo=0x140 fseo=0x160 avsp=0x20 tosp=0x3db 0xe:pti[0] nrow=8 offs=0 0x12:pti[1] nrow=7 offs=8 0x16:pti[2] nrow=1 offs=15 0x1a:pti[3] nrow=10 offs=16 0x1e:pti[4] nrow=15 offs=26 0x22:pti[5] nrow=100 offs=41 ..... col 0: [ 2] c1 14 tab 1, row 0, @0x1e65 tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 0 tab 1, row 1, @0x1ddb tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 1 tab 1, row 2, @0x1d51 tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 2 tab 1, row 3, @0x1ccd tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 4 tab 1, row 4, @0x1c45 tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 5 tab 1, row 5, @0x1bc0 tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 6 tab 1, row 6, @0x1b35 tl: 4 fb: -CHDFL-- lb: 0x2 cc: 0 cki: 7 tab 2, row 0, @0x160 tl: 48 fb: -CH-FL-- lb: 0x0 cc: 19 cki: 3 col 0: [ 2] c1 1e col 1: [ 1] 80 ..... col 17: *NULL* col 18: [ 1] 80 tab 3, row 0, @0x1ad0 ..... |
看到这里,我就想是否可以通过bbed先把这2个block 给修复了,看看是否能够起来。如下是简单的修复过程:
对于51号block 由于是Index 修改非常简单,这里不多说。26号block 是cluster table,这个相对复杂的多。首先提交事务、修改lock flag之后verify还是报错,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
BBED> verify DBVERIFY - Verification starting FILE = /u01/fix/SYSTEM_OLD.dbf BLOCK = 26 Block Checking: DBA = 4194330, Block Type = KTB-managed data block data header at 0x105d485c kdbchk: key comref count wrong keyslot=7 Block 26 failed with check code 6121 DBVERIFY - Verification complete Total Blocks Examined : 1 Total Blocks Processed (Data) : 1 Total Blocks Failing (Data) : 1 Total Blocks Processed (Index): 0 Total Blocks Failing (Index): 0 Total Blocks Empty : 0 Total Blocks Marked Corrupt : 0 Total Blocks Influx : 0 |
这里继续修改聚簇对应的kdbr信息(这里以其中一个kdbr为例):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
BBED> p *kdbr[7] rowdata[7568] ------------- ub1 rowdata[7568] @8012 0xac BBED> x /rcccccc rowdata[7568] @8012 ------------- flag@8012: 0xac (KDRHFL, KDRHFF, KDRHFH, KDRHFK) lock@8013: 0x00 cols@8014: 1 kref@8015: 31 mref@8017: 30 hrid@8019:0x0040001b.0 nrid@8025:0x0040001b.0 BBED> modify /x 1f offset 8017 File: /u01/fix/SYSTEM_OLD.dbf (1) Block: 26 Offsets: 8017 to 8020 Dba:0x0040001a ------------------------------------------------------------------------ 1f000040 <32 bytes per line BBED> verify DBVERIFY - Verification starting FILE = /u01/fix/SYSTEM_OLD.dbf BLOCK = 26 Block Checking: DBA = 4194330, Block Type = KTB-managed data block data header at 0x105d485c kdbchk: space available on commit is incorrect tosp=987 fsc=0 stb=0 avsp=32 Block 26 failed with check code 6111 DBVERIFY - Verification complete Total Blocks Examined : 1 Total Blocks Processed (Data) : 1 Total Blocks Failing (Data) : 1 Total Blocks Processed (Index): 0 Total Blocks Failing (Index): 0 Total Blocks Empty : 0 Total Blocks Marked Corrupt : 0 Total Blocks Influx : 0 BBED> p kdbh struct kdbh, 14 bytes @92 ub1 kdbhflag @92 0x00 (NONE) b1 kdbhntab @93 6 b2 kdbhnrow @94 141 sb2 kdbhfrre @96 -1 sb2 kdbhfsbo @98 320 sb2 kdbhfseo @100 352 b2 kdbhavsp @102 32 b2 kdbhtosp @104 987 BBED> d /v offset 104 count 2 File: /u01/fix/SYSTEM_OLD.dbf (1) Block: 26 Offsets: 104 to 105 Dba:0x0040001a ------------------------------------------------------- db03 l .. <16 bytes per line> BBED> modify /x 2000 offset 104 File: /u01/fix/SYSTEM_OLD.dbf (1) Block: 26 Offsets: 104 to 105 Dba:0x0040001a ------------------------------------------------------------------------ 2000 <32 bytes per line> BBED> sum apply Check value for File 1, Block 26: current = 0x87ce, required = 0x87ce BBED> verify DBVERIFY - Verification starting FILE = /u01/fix/SYSTEM_OLD.dbf BLOCK = 26 DBVERIFY - Verification complete Total Blocks Examined : 1 Total Blocks Processed (Data) : 1 Total Blocks Failing (Data) : 0 Total Blocks Processed (Index): 0 Total Blocks Failing (Index): 0 Total Blocks Empty : 0 Total Blocks Marked Corrupt : 0 Total Blocks Influx : 0 |
我们经过几处简单修改之后,再次verify校验已经不再报错了;不过再次open数据库时,发现报另外一个错误了:
1 2 3 4 |
Errors in file /u01/app/oracle/admin/orcl/udump/orcl1_ora_18955.trc: ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00600: internal error code, arguments: [kdoirp-3], [139], [0], [], [], [], [], [] |
从错误来看,bootstrap的初始化过程仍然有问题。通过10046 trace跟踪发现还是那几个block。
回想前面这个block的dump时,看到的几行操作是delete,如下:tl: 4 fb: -CHDFL– lb: 0x2 cc: 0 cki: 0
那么我们这里试做将这几个被删除的操作进行还原是否ok 呢? 也就是用bbed来恢复这7个delete操作。
由于是cluster table 的block,操作相对麻烦一些。不过我尝试修改之后,最后发现错误仍然一样。
其中[kdoirp-3]是什么含义呢? 我们来看下Oracle 文档的描述:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Layer 11: KCOCODRW - Row opcode 1 : KDOIUR - interpret undo redo opcode 2 : KDOIRP - insert row piece opcode 3 : KDODRP - drop row piece opcode 4 : KDOLKR - lock row piece opcode 5 : KDOURP - update row piece opcode 6 : KDOORP - overwrite row piece opcode 7 : KDOMFC - manipulate first column opcode 8 : KDOCFA - change forwarding address opcode 9 : KDOCKI - change cluster key index opcode 10 : KDOSKL - set key links opcode 11 : KDOQMI - quick multi-insert (ex. insert as select...) opcode 12 : KDOQMD - quick multi-delete opcode 13 : KDOTBF - toggle block header flags |
很明显,这表示insert row piece。 看来我们单纯的修改这2个block 并不能绕过这个问题。 实际上后面我dump分析发现又涉及到_next_object,又将问题复杂化了。
虽然我相信多折腾几次可以解决这个问题。但是操作确实麻烦,费劲。不过此时通过之前的备份restore出来的system文件已经ok了。这里我用bbed 将涉及到的几个block 进行替换,最后再修改resetlogs信息,重建控制文件之后,进行recover。
非常顺利的打开了数据库。最后检查alert log 还涉及到smon 回滚某个事务失败。那么如何完美处理呢?
首先dump undo header,然后获取该事务涉及的操作对象,然后使用如下参数屏蔽回滚段后,将undo表空间重建即可。
如下是dump undo header获取的信息:
针对这部分对象,由于破坏了事务的完整性,那么建议对表进行分析,其中Index进行重建。
Leave a Reply
You must be logged in to post a comment.