存储Cache 丢失导致数据库无法open的案例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 存储Cache 丢失导致数据库无法open的案例
最近某客户的一套核心数据库由于存储问题导致清掉Cache之后无法启动。首先我们来看看数据库在启动的时候报什么错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Thu Sep 21 19:35:55 2017 WARNING: Write Failed. group:1 disk:3 AU:53436 offset:95744 size:1024 Errors in file /u01/app/oracle/diag/rdbms/ods/xxx2/trace/xxx2_lgwr_14636.trc: ORA-15080: synchronous I/O operation to a disk failed WARNING: failed to write mirror side 1 of virtual extent 43 logical extent 0 of file 265 in group 1 on disk 3 allocation unit 53436 Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx2/trace/xxx2_lgwr_14636.trc: ORA-00345: redo log write error block 88251 count 2 ORA-00312: online log 3 thread 2: '+DATA/xxx/onlinelog/group_3.265.816035881' ORA-15081: failed to submit an I/O operation to a disk ORA-15081: failed to submit an I/O operation to a disk Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx2/trace/xxx2_lgwr_14636.trc: ORA-00340: IO error processing online log 3 of thread 2 ORA-00345: redo log write error block 88251 count 2 ORA-00312: online log 3 thread 2: '+DATA/xxx/onlinelog/group_3.265.816035881' ORA-15081: failed to submit an I/O operation to a disk ORA-15081: failed to submit an I/O operation to a disk LGWR (ospid: 14636): terminating the instance due to error 340 |
错误并不复杂。可以看到Oracle这里已经无法正常写Redo logfile了。由于这套数据库是非归档,只有逻辑备份,因此即使恢复成功也面临数据丢失的可能性。首先我在尝试进行恢复时,发现居然无法mount数据库,在mount过程中实例被直接终止了,感觉非常奇怪。也没有报非常明显的错误。mount过程出错,那么无疑是controlfile存在异常;由于没有controlfile备份,因此这里先手工重建控制文件,如下是脚本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
CREATE CONTROLFILE REUSE DATABASE "XXX" RESETLOGS NOARCHIVELOG MAXLOGFILES 16 MAXLOGMEMBERS 3 MAXDATAFILES 100 MAXINSTANCES 8 MAXLOGHISTORY 584 LOGFILE GROUP 1 '+data/ods/ONLINELOG/group_1.257.816033845' SIZE 500M BLOCKSIZE 512, GROUP 2 '+data/xxx/ONLINELOG/group_2.258.816033845' SIZE 500M BLOCKSIZE 512, GROUP 3 '+data/xxx/ONLINELOG/group_3.265.816035881' SIZE 500M BLOCKSIZE 512, GROUP 4 '+data/xxx/ONLINELOG/group_4.266.816035883' SIZE 500M BLOCKSIZE 512, GROUP 5 '+data/xxx/ONLINELOG/group_5.275.816036347' SIZE 500M BLOCKSIZE 512, GROUP 6 '+data/xxx/ONLINELOG/group_6.277.816036359' SIZE 500M BLOCKSIZE 512 DATAFILE '+DATA/xxx/datafile/system.259.816033847', '+DATA/xxx/datafile/sysaux.260.816033849', '+DATA/xxx/datafile/undotbs1.261.816033851', '+DATA/xxx/datafile/undotbs2.263.816033859', '+DATA/xxx/datafile/users.264.816033859', '+DATA/xxx/datafile/tbs_tbdata.278.816036381', '+DATA/xxx/datafile/tbs_omdata.283.816036779', '+DATA/xxx/datafile/tbs_cmdata.284.816036813', '+DATA/xxx/datafile/tbs_dmdata.285.816036857', '+DATA/xxx/datafile/tbs_dbetl.286.816036893', '+DATA/xxx/datafile/tbs_schedule.287.816036909', '+DATA/xxx/datafile/tbs_meast.288.816036915', '+DATA/xxx/datafile/tbs_m1104.289.816036939', '+DATA/xxx/datafile/tbs_mpisa.293.842192725', '+DATA/xxx/datafile/tbs_mpfsc', '+DATA/xxx/datafile/tbs_msafe', '+DATA/xxx/datafile/tbs_mecsp', '+DATA/xxx/datafile/tbs_mpbss', '+DATA/xxx/datafile/tbs_mpbfc', '+DATA/xxx/datafile/idx_cmdata' CHARACTER SET ZHS16GBK; |
重建完毕后。其实这里我首先尝试了进行noresetlogs创建,但是发现报错:
1 2 |
ORA-00600: internal error code, arguments: [2762], [4294967295], [1024000], [+DATA/xxx/onlinelog/group_3.265.816035881], [], [], [], [], [], [], [], [] |
很明显,Redo logfile有问题;看来还是只能Resetlogs方式创建。创建完毕之后,尝试进行了recover database using backup controlfile until cancel恢复操作;然后通过隐含参数强制open发现还是有如下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
Fri Sep 22 13:00:10 2017 SMON: enabling cache recovery Instance recovery: looking for dead threads Instance recovery: lock domain invalid but no dead threads Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc (incident=120288): ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_120288/xxx1_ora_1593_i120288.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc: ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], [] Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc: ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], [] Error 600 happened during db open, shutting down database USER (ospid: 1593): terminating the instance due to error 600 Instance terminated by USER, pid = 1593 ORA-1092 signalled during: alter database open resetlogs... |
这是非常经典的错误了;由于这是scn的问题;而且数据库版本为11.2.0.3.0,未安装任何psu。因此这里是可以直接推进scn的。
直接通过10015 event来推进数据库的scn;另外由于是异常关机,那么这里Undo必然也无法进行正常恢复;因此同时设置undo_management参数为manual,并同时设置10015 event:
alter session set events ‘10015 trace name adjust_scn level 2’;
顺利打开了数据库。打开数据库之后立刻重建数据库Undo和temp,如下:
1 2 3 4 5 6 7 |
create undo tablespace undo1 datafile '+data' size 2048m; create undo tablespace undo2 datafile '+data' size 2048m; drop tablespace undotbs1 including contents and datafiles; drop tablespace undotbs2 including contents and datafiles; alter tablespace temp add tempfile '+data/rep/tempfile/TEMP.276.816036349' reuse; alter tablespace temp add tempfile '+data/ods/tempfile/temp1607202' reuse; alter tablespace temp add tempfile '+data/ods/tempfile/temp1607203' reuse; |
再次重启数据库之后,发现alert log仍然有一些错误;这是在所难免的。如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
ORA-00604: error occurred at recursive SQL level 1 ORA-08102: index key not found, obj# 290, file 1, block 1029 (2) ORA-12012: error on auto execute of job 4001 ORA-08102: index key not found, obj# 290, file 1, block 1029 (2) Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_j003_8160.trc: ORA-00604: error occurred at recursive SQL level 1 ORA-08102: index key not found, obj# 290, file 1, block 1029 (2) ORA-12012: error on auto execute of job 4002 ORA-08102: index key not found, obj# 290, file 1, block 1029 (2) Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_8043.trc: Fri Sep 22 13:30:40 2017 Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_8043.trc: Fri Sep 22 13:30:42 2017 Dumping diagnostic data in directory=[cdmp_20170922133042], requested by (instance=1, osid=8043), summary=[abnormal process termination]. Fri Sep 22 13:31:59 2017 Starting background process SMCO Fri Sep 22 13:31:59 2017 SMCO started with pid=35, OS id=9375 Fri Sep 22 13:37:54 2017 Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc (incident=144379): ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_144379/xxx1_m000_10623_i144379.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Sep 22 13:37:55 2017 Dumping diagnostic data in directory=[cdmp_20170922133755], requested by (instance=1, osid=10623 (M000)), summary=[incident=144379]. Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc (incident=144380): ORA-00600: internal error code, arguments: [kewrose_1], [600], [ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] ], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_144380/xxx1_m000_10623_i144380.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc: ORA-00600: internal error code, arguments: [kewrose_1], [600], [ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], [] ], [], [], [], [], [], [], [], [], [] Dumping diagnostic data in directory=[cdmp_20170922133757], requested by (instance=1, osid=10623 (M000)), summary=[incident=144380]. Fri Sep 22 13:37:58 2017 Sweep [inc][144380]: completed |
实际上当时在进行恢复时,我手工处理掉了obj# 290;但是进一步检查发现obj$,col_usage$ ,i_obj4# 都存在问题。而且不一致的记录还比较多:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
select /*+ index(t i_obj4) */ DATAOBJ#,type#,owner# from obj$ t minus select /*+ full(t1) */ DATAOBJ#,type#,owner# from obj$ t1; DATAOBJ# TYPE# OWNER# ---------- ---------- ---------- 1451154 2 90 1589557 1 92 1589558 2 92 1589573 2 100 1589574 2 100 1589575 2 100 1589576 2 100 1589577 2 100 1589578 2 100 1589579 2 100 1589580 2 100 1589581 2 100 1589582 2 100 1589583 2 100 1589584 2 100 1589585 2 100 1589586 2 100 1589587 2 100 1589588 2 100 1589589 2 100 1589590 2 100 1589591 2 100 1589592 2 100 1589593 2 100 1589594 2 100 1589595 2 100 1589596 2 100 1589597 2 100 1589598 2 100 1589599 2 100 1589600 2 100 1589601 2 100 1589602 2 100 1589603 2 100 1589604 2 100 1589607 2 100 1589612 0 0 select /*+ full(t1) */ DATAOBJ#,type#,owner# from obj$ t1 minus select /*+ index(t i_obj4) */ DATAOBJ#,type#,owner# from obj$ t ; DATAOBJ# TYPE# OWNER# ---------- ---------- ---------- 1587659 2 100 1587660 2 100 1587661 2 100 1587662 2 100 1587663 2 100 1587664 2 100 1587665 2 100 1587666 2 100 1587667 2 100 1587668 2 100 1587669 2 100 1587670 2 100 1587671 2 100 1587672 2 100 1587673 2 100 1587674 2 100 1587675 2 100 1587676 2 100 1587677 2 100 1587678 2 100 1587679 2 100 1587680 2 100 1587681 2 100 1587682 2 100 1587683 2 100 1587684 2 100 1587685 2 100 1587687 2 100 1587688 2 100 1587689 2 100 1587690 2 100 1587691 2 100 1587692 2 100 1587695 1 92 1587696 2 92 1587716 2 90 1587717 2 90 1587718 2 90 1587719 2 90 1587720 2 90 1587721 2 90 1587722 2 90 1587723 2 90 1587724 2 90 1587725 2 90 1587726 2 90 1587727 2 90 1587728 2 90 1587729 2 90 1587730 2 90 1587732 2 90 1587733 2 90 1589527 0 0 |
最开始我还尝试通过bbed修复了2个Block;最后发现依然难以处理这个ora-08102错误;后续通过上述sql比较发现居然有如此多的记录不一致。修改起来太过麻烦了。
这里其实本来想尝试通过重建obj$,i_obj4$,col_usage$ 来解决的。但是担心有较大的风险,因此这里建议可以进行了数据库重建。由于obj$这里有问题,expdp操作都报错,无法执行任何ddl操作。因此最好通过exp拆分脚本来进行重建处理。整个数据库恢复➕重建过程将近20小时左右(2tb左右的库).
由于客户存储环境io较差,因此导致整个重建过程比较复杂,比较耗时。我们在开玩笑讲到,如果可能的数据库运行在我们的Zdata环境上,那么数据库重建过程在2小时内即可完成;而且也不会出现类似故障;因此Zdata的io操作上直接落盘或者写到Pcie上;不存在数据丢失的风险。
最后补充一点:
1) 由于数据库很多事务无法正常恢复,导致SMON在不断尝试进行事务恢复时报错,达到一定次数之后会crash实例;进而影响数据库的重建工作。可通过设置_smon_internal_errlimit 参数来避免该问题。
2) 为了加快exp和imp速度,这里我们利用了管道技术,脚本如下:
1 2 3 |
mknod /backup/omdata10 exp \'/ as sysdba\' parfile=omdata.par file=/backup/omdata10 rows=y indexes=n compress=n direct=y recordlength=65536 buffer=52488000 feedback=100000 volsize=0 log=omdata_other.log; imp \'/ as sysdba\' file=/backup/omdata10 fromuser=omdata touser=omdata buffer=52488000 |
Leave a Reply
You must be logged in to post a comment.