数套ASM RAC的恢复案例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 数套ASM RAC的恢复案例
前不久帮助某客户恢复了6套Oracle RAC,均为ASM,而且版本均为10.2.0.4。熬夜好几天,差点吐血了。
这里以其中一套库的恢复进行简单说明,跟大家分享。
其中几套基本上都遇到了如下的ORA-00600 错误:
1 2 3 4 5 6 7 |
Thu Dec 31 11:55:46 2015 SUCCESS: diskgroup DG1 was mounted Thu Dec 31 11:55:50 2015 Errors in file /oracle/admin/xxx/udump/xxx1_ora_28803.trc: ORA-00600: internal error code, arguments: [kccpb_sanity_check_2], [13715626], [13715623], [0x000000000], [], [], [], [] SUCCESS: diskgroup DG1 was dismounted Thu Dec 31 11:55:51 2015 |
对于该错误,其实很简单,主要是因为控制文件损坏,通过重建控制文件或者利用备份的控制文件进行restore即可进行mount;甚至于我们利用控制文件快照都可以进行数据库mount;然后接着进行恢复操作。在恢复的过程中还遇到了如下的错误:
1 2 3 4 5 6 7 8 9 10 11 12 |
Errors in file /oracle/admin/xxx/udump/xxx1_ora_6990.trc: ORA-00600: internal error code, arguments: [kclchkblk_4], [3431], [18446744072948603858], [3431], [18446744072948586897], [], [], [] Tue Jan 5 10:52:28 2016 Errors in file /oracle/admin/xxx/bdump/xxx1_arc0_8205.trc: ORA-19504: failed to create file "+DG1/xxx/archivelog/1_2_900069464.dbf" ORA-17502: ksfdcre:4 Failed to create file +DG1/xxx/archivelog/1_2_900069464.dbf ORA-00600: internal error code, arguments: [kffbAddBlk04], [], [], [], [], [], [], [] Tue Jan 5 10:52:28 2016 ARC0: Error 19504 Creating archive log file to '+DG1/xxx/archivelog/1_2_900069464.dbf' ARCH: Archival stopped, error occurred. Will continue retrying Tue Jan 5 10:52:30 2016 ORACLE Instance xxx1 - Arc |
上述的ORA-00600错误其实很简单,主要是数据块SCN的问题。这里以其中一套库的恢复进行大致说明,因为在恢复该库的过程中,遇到了一件十分神奇的事情。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
SQL> startup mount pfile='/tmp/p.ora'; ORACLE instance started. Total System Global Area 2.1475E+10 bytes Fixed Size 2122368 bytes Variable Size 2399145344 bytes Database Buffers 1.9059E+10 bytes Redo Buffers 14651392 bytes Database mounted. SQL> ALTER DATABASE ADD LOGFILE THREAD 2 2 GROUP 3 ( 3 '+DG/xxxx/onlinelog/group_3.271.752099989', 4 '+DG/xxxx/onlinelog/group_3.272.752099991' 5 ) SIZE 100M REUSE, 6 GROUP 4 ( 7 '+DG/xxxx/onlinelog/group_4.273.752099991', 8 '+DG/xxxx/onlinelog/group_4.274.752099993' 9 ) SIZE 100M REUSE, 10 GROUP 6 ( 11 '+DG/xxxx/onlinelog/group_6.275.752099993', 12 '+DG/xxxx/onlinelog/group_6.276.752099993' 13 ) SIZE 100M REUSE; ALTER DATABASE ADD LOGFILE THREAD 2 * ERROR at line 1: ORA-01276: Cannot add file +DG/xxxx/onlinelog/group_3.271.752099989. File has an Oracle Managed Files file name. |
由于是ORACLE RAC,因此重建控制文件之后,是需要添加redo logfile的;然而add logfile 发现报上述错误。根据Oracle metalink的一些方法均不能成功,都报上面的错误,确实很怪异。
有些人看上述的错误,可能会认为是设置了OMF的参数,其实这里并不是,我将相关参数全部修改之后,错误依旧。
这里实际上添加logfile时,只写磁盘组名称就行了,不需要写绝对路径。
接着在进行recover后进行open resetlogs打开时,报错ORA-01248,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
SQL> startup mount pfile='/tmp/p.ora'; ORACLE instance started. Total System Global Area 2.1475E+10 bytes Fixed Size 2122368 bytes Variable Size 2399145344 bytes Database Buffers 1.9059E+10 bytes Redo Buffers 14651392 bytes Database mounted. SQL> recover database using backup controlfile until cancel; ORA-00279: change 13300428179625 generated at 04/04/2013 12:51:35 needed for thread 1 ORA-00289: suggestion : +DG/archivelog/arch1_752099890_12809_1.log ORA-00280: change 13300428179625 for thread 1 is in sequence #12809 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} auto ORA-00308: cannot open archived log '+DG/archivelog/arch1_752099890_12809_1.log' ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_752099890_12809_1.log ORA-15173: entry 'arch1_752099890_12809_1.log' does not exist in directory 'archivelog' ORA-00308: cannot open archived log '+DG/archivelog/arch1_752099890_12809_1.log' ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_752099890_12809_1.log ORA-15173: entry 'arch1_752099890_12809_1.log' does not exist in directory 'archivelog' ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below ORA-01194: file 1 needs more recovery to be consistent ORA-01110: data file 1: '+DG/xxxx/datafile/system.256.752099833' SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-01248: file 42 was created in the future of incomplete recovery ORA-01110: data file 42: '+DG/xxxx/datafile/file_tab_xdidx03.ora' |
这个错误还是比较少见,实际上网上那些说法,以及Oracle mos提供的解决方法我发现都不行。
无奈只能先将其offline ,然后再进行恢复。再进行open之前我查询了当前的checkpoint scn如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
SQL> select file#,checkpoint_change# from v$datafile; FILE# CHECKPOINT_CHANGE# ---------- ----------------------- 1 14731601024328 2 14731601024328 3 14731601024328 4 13300428179625 5 14731601024328 6 14731601024328 7 14731601024328 ....... 39 14731601024328 40 14731601024328 41 14731601024328 42 14731601024328 43 14731601024328 43 rows selected. SQL> c/datafile/datafile_header 1* select file#,checkpoint_change# from v$datafile_header SQL> / FILE# CHECKPOINT_CHANGE# ---------- ----------------------- 1 14731601024328 2 14731601024328 3 14731601024328 4 13300428179625 5 14731601024328 6 14731601024328 7 14731601024328 ...... 40 14731601024328 41 14731601024328 42 14731601024328 43 14731601024328 43 rows selected. |
由于open失败,这里我想着是不是这2个文件有问题,又用之前的快照控制文件进行recover一把,然后再次用重建的控制文件起来数据库进行recover,发现神奇的事情出现了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
SQL> recover database using backup controlfile until cancel; ORA-00279: change 13305808683011 generated at 01/11/2016 21:09:02 needed for thread 1 ORA-00289: suggestion : +DG/archivelog/arch1_900882531_1_1.log ORA-00280: change 13305808683011 for thread 1 is in sequence #1 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} auto ORA-00308: cannot open archived log '+DG/archivelog/arch1_900882531_1_1.log' ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_900882531_1_1.log ORA-15173: entry 'arch1_900882531_1_1.log' does not exist in directory 'archivelog' ORA-00308: cannot open archived log '+DG/archivelog/arch1_900882531_1_1.log' ORA-17503: ksfdopn:2 Failed to open file +DG/archivelog/arch1_900882531_1_1.log ORA-15173: entry 'arch1_900882531_1_1.log' does not exist in directory 'archivelog' ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below ORA-01194: file 1 needs more recovery to be consistent ORA-01110: data file 1: '+DG/xxxx/datafile/system.256.752099833' SQL> alter database datafile 42 offline; alter database datafile 43 offline; Database altered. SQL> Database altered. SQL> SQL> alter database open resetlogs; alter database open resetlogs * ERROR at line 1: ORA-01092: ORACLE instance terminated. Disconnection forced |
我们可以看到open 失败了,对于open失败的 情况,我们首先是看alert log,接着10046 trace。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
SQL> startup nomount pfile='/tmp/p.ora'; ORACLE instance started. Total System Global Area 2.1475E+10 bytes Fixed Size 2122368 bytes Variable Size 2399145344 bytes Database Buffers 1.9059E+10 bytes Redo Buffers 14651392 bytes SQL> oradebug setmypid Statement processed. SQL> alter database mount; Database altered. SQL> recover database; Media recovery complete. SQL> alter database open; alter database open * ERROR at line 1: ORA-01092: ORACLE instance terminated. Disconnection forced ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00604: error occurred at recursive SQL level 1 ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$" too small |
这里我又屏蔽了undo相关的参数。再次尝试发现错误依旧。再次启动,神奇的事情出现了,SCN居然倒退了?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
SQL> startup mount pfile='/tmp/p.ora'; ORACLE instance started. Total System Global Area 2.1475E+10 bytes Fixed Size 2122368 bytes Variable Size 2399145344 bytes Database Buffers 1.9059E+10 bytes Redo Buffers 14651392 bytes Database mounted. SQL> recover database; Media recovery complete. SQL> select checkpoint_change#,file# from v$datafile; CHECKPOINT_CHANGE# FILE# --------------------------- ---------- 13314398637607 1 13314398637607 2 13314398637607 3 13314398637607 4 13314398637607 5 ...... 13314398637607 38 13314398637607 39 13314398637607 40 13314398637607 41 0 42 0 43 43 rows selected. SQL> select checkpoint_change#,file#,checkpoint_time from v$datafile_header; CHECKPOINT_CHANGE# FILE# CHECKPOIN --------------------------- ---------- --------- 13314398637607 1 11-JAN-16 13314398637607 2 11-JAN-16 13314398637607 3 11-JAN-16 13314398637607 4 11-JAN-16 13314398637607 5 11-JAN-16 ...... 13314398637607 39 11-JAN-16 13314398637607 40 11-JAN-16 13314398637607 41 11-JAN-16 14731601024328 42 30-DEC-15 14731601024328 43 30-DEC-15 43 rows selected. |
很明显,这个133的scn 回退到了过去2年前了,出现时空穿越了。。。。 当然,open肯定还是报错:
1 2 3 4 5 6 7 8 9 10 |
SQL> alter database open; alter database open * ERROR at line 1: ORA-01092: ORACLE instance terminated. Disconnection forced ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00604: error occurred at recursive SQL level 1 ORA-01555: snapshot too old: rollback segment number 4 with name "_SYSSMU4$" too small |
这里先不管为啥连数据文件头的SCN都倒退了(之前被offline的2个文件scn是OK的). 通过10046 trace得到如下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
PARSING IN CURSOR #5 len=52 dep=1 uid=0 oct=3 lid=0 tim=1418474357830663 hv=429618617 ad='5db7ea50' select ctime, mtime, stime from obj$ where obj# = :1 END OF STMT PARSE #5:c=0,e=531,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1418474357830659 BINDS #5: kkscoacd Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0 kxsbbbfp=2ad7172aa020 bln=22 avl=02 flg=05 value=20 EXEC #5:c=1000,e=673,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=1418474357831431 WAIT #5: nam='db file sequential read' ela= 9843 file#=1 block#=218 blocks=1 obj#=-1 tim=1418474357841421 ..... FETCH #6:c=17997,e=64936,p=23,cr=566,cu=0,mis=0,r=1,dep=2,og=3,tim=1418474357907669 STAT #6 id=1 cnt=1 pid=0 pos=1 obj=15 op='TABLE ACCESS BY INDEX ROWID UNDO$ (cr=566 pr=23 pw=0 time=64913 us)' STAT #6 id=2 cnt=1 pid=1 pos=1 obj=34 op='INDEX UNIQUE SCAN I_UNDO1 (cr=1 pr=1 pw=0 time=5769 us)' WAIT #5: nam='db file sequential read' ela= 13031 file#=40 block#=167538 blocks=1 obj#=-1 tim=1418474357920819 FETCH #5:c=19996,e=89548,p=25,cr=568,cu=0,mis=0,r=0,dep=1,og=4,tim=1418474357921006 |
我们这里可以看到,这里报错的SQL读取了file 1 block 218,以及file 40 block 167538。
对于file 1 block 218,我dump 发现没有活动事务;而file 40 block 167538则为undo 块.
1 2 3 4 5 |
SQL> select name from v$datafile where file#=40; NAME -------------------------------------------------------------------------------- +DG/xxxx/datafile/undotbs4 |
同时dump 了这个undo 块,发现确实感觉有些异常,如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
******************************************************************************** UNDO BLK: xid: 0x0009.01b.0014320a seq: 0xa47 cnt: 0x1 irb: 0x1 icl: 0x0 flg: 0x0000 Rec Offset Rec Offset Rec Offset Rec Offset Rec Offset --------------------------------------------------------------------------- 0x01 0x0014 *----------------------------- * Rec #0x1 slt: 0x1b objn: 55417(0x0000d879) objd: 296039 tblspc: 20(0x00000014) * Layer: 10 (Index) opc: 21 rci 0x00 Undo type: Regular undo Last buffer split: No Temp Object: No Tablespace Undo: No rdba: 0x0a028e71 *----------------------------- index general undo (branch) operations KTB Redo op: 0x05 ver: 0x01 op: R itc: 2 Itl Xid Uba Flag Lck Scn/Fsc 0x01 0x0009.01b.0014320a 0x0a028e71.0a47.04 ---- 1 fsc 0x0000.00000000 0x02 0x0009.02b.001428b6 0x0a028e6f.0a47.06 ---- 112 fsc 0x0000.00000000 Dump kdige : block dba :0x05d630b3, seghdr dba: 0x06076e89 restore block before image |
由于所有的文件头SCN 都倒退了,正常open 都报错,只能推进SCN,而且SCN必须要比这个undo block的最大SCN 还要大一些才行,通过在pfile文件中加入参数*._minimum_giga_scn即可。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
SQL> conn /as sysdba Connected to an idle instance. SQL> startup mount pfile='/tmp/p.ora'; ORACLE instance started. Total System Global Area 2.1475E+10 bytes Fixed Size 2122368 bytes Variable Size 2399145344 bytes Database Buffers 1.9059E+10 bytes Redo Buffers 14651392 bytes Database mounted. SQL> show parameter job NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ job_queue_processes integer 10 SQL> alter system set job_queue_processes=0; System altered. SQL> alter database open; Database altered. SQL> drop tablespace undotbs3 including contents and datafiles; Tablespace dropped. SQL> drop tablespace undotbs4 including contents and datafiles; Tablespace dropped. |
顺利打开数据库之后,立即将原有的undo 表空间进行drop 并重建。
虽然数据库是打开了,然而其中有2个数据文件之前被我们offline了,而且中间进行了resetlogs操作,因此现在无法进行正常online了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
SQL> select file#,checkpoint_change# from v$datafile; FILE# CHECKPOINT_CHANGE# ---------- ------------------------------ 1 14759124431097 2 14759124431097 3 14759124431097 4 14759124431097 5 14759124431097 ...... 36 14759124431097 37 14759124431097 38 14759124431097 41 14759124431097 42 0 43 0 44 14759124431097 45 14759124431097 43 rows selected. SQL> alter database datafile 42 online; alter database datafile 42 online * ERROR at line 1: ORA-01190: control file or data file 42 is from before the last RESETLOGS ORA-01110: data file 42: '+DG/xxxx/datafile/file_tab_xxx03.ora' SQL> alter database datafile 43 online; alter database datafile 43 online * ERROR at line 1: ORA-01190: control file or data file 43 is from before the last RESETLOGS ORA-01110: data file 43: '+DG/xxxx/datafile/file_tab1_xxx05.ora' |
这里用bbed 将上面2个文件头相关信息修改掉,然后进行recover,可以顺利online文件。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
SQL> recover datafile 42; Media recovery complete. SQL> alter database datafile 42 online; Database altered. SQL> recover datafile 43; Media recovery complete. SQL> alter database datafile 43 online; Database altered. SQL> select file#,checkpoint_change# ,status from v$datafile; FILE# CHECKPOINT_CHANGE# STATUS ---------- ------------------------------ ------- 1 14759124821491 SYSTEM 2 14759124821491 SYSTEM 3 14759124821491 ONLINE 4 14759124821491 ONLINE 5 14759124821491 ONLINE 。。。。。。 36 14759124821491 ONLINE 37 14759124821491 ONLINE 38 14759124821491 ONLINE 41 14759124821491 ONLINE 42 14759124831966 ONLINE 43 14759124832115 ONLINE 44 14759124821491 ONLINE 45 14759124821491 ONLINE 43 rows selected. SQL> alter system checkpoint; System altered. SQL> select file#,checkpoint_change# ,status from v$datafile; FILE# CHECKPOINT_CHANGE# STATUS ---------- ------------------------------ ------- 1 14759124832224 SYSTEM 2 14759124832224 SYSTEM 3 14759124832224 ONLINE 4 14759124832224 ONLINE 5 14759124832224 ONLINE ...... 38 14759124832224 ONLINE 41 14759124832224 ONLINE 42 14759124832224 ONLINE 43 14759124832224 ONLINE 44 14759124832224 ONLINE 45 14759124832224 ONLINE 43 rows selected. |
最后建议将数据库expdp 导出并重建。到此告一段落!
Leave a Reply
You must be logged in to post a comment.