某客户超融合硬件损坏导致Oracle RAC异常的恢复实录
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 某客户超融合硬件损坏导致Oracle RAC异常的恢复实录
前几天某客户遇到一个棘手问题;其一套Oracle rac环境运行在HW超融合环境中;由于硬件问题导致数据库crash;期间出现了不少数据坏块;不过还好客户有RMAN物理备份;因此客户在找到我之前就已经先进行了全库Restore;首先我们来看下相关日志信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Trying mirror side DATADG_0002. Reread of blocknum=145005, file=+DATADG/xxxxx/datafile/supervise. found same corrupt data Reread of blocknum=145005, file=+DATADG/xxxxx/datafile/supervise. found valid data Sun Apr 26 01:22:35 2020 Hex dump of (file 11, block 230454) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_13254.trc Corrupt block relative dba: 0x02c38436 (file 11, block 230454) Bad header found during backing up datafile Data in bad block: type: 6 format: 2 rdba: 0x0082f736 last change scn: 0x0000.4b5a95d7 seq: 0x1 flg: 0x06 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0x95d70601 check value in block header: 0xa3d1 computed block checksum: 0x0 Trying mirror side DATADG_0002. Reread of blocknum=230454, file=+DATADG/xxxxx/datafile/supdb_etyy.dbf. found same corrupt data Reread of blocknum=230454, file=+DATADG/xxxxx/datafile/supdb_etyy.dbf. found valid data Hex dump of (file 11, block 230458) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_13254.trc Corrupt block relative dba: 0x02c3843a (file 11, block 230458) |
上述类似的大量坏块信息,最终导致数据库挂掉;其中我们可以看到客户这里asm diskgroup为normal冗余;当primary extent数据有问题时,Oracle会尝试从mirror extent去获取;如果mirror extent是正常的;那么Oracle会自动进行修复;否则会导致数据丢失;严重的话会导致数据库宕机。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Mon Apr 27 02:01:15 2020 Read of datafile '+DATADG/orclsk/datafile/sysaux.260.956077231' (fno 2) header failed with ORA-01207 Rereading datafile 2 header from mirror side 'DATADG_0000' failed with ORA-01207 Rereading datafile 2 header from mirror side 'DATADG_0006' failed with ORA-01207 Errors in file /u01/app/oracle/diag/rdbms/orclsk/orclsk1/trace/orclsk1_ckpt_4893.trc: ORA-63999: data file suffered media failure ORA-01122: database file 2 failed verification check ORA-01110: data file 2: '+DATADG/orclsk/datafile/sysaux.260.956077231' ORA-01207: file is more recent than control file - old control file Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ckpt_4893.trc: ORA-63999: data file suffered media failure ORA-01122: database file 2 failed verification check ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231' ORA-01207: file is more recent than control file - old control file CKPT (ospid: 4893): terminating the instance due to error 63999 |
最因为部分文件需要介质恢复(因为primary和mirror 数据都异常);而数据库强行终止了实例。数据库实例重启后则开始出现大量错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
Mon Apr 27 02:02:03 2020 Read of datafile '+DATADG/xxxxx/datafile/mscsz.dbf' (fno 40) header failed with ORA-01207 Rereading datafile 40 header from mirror side 'DATADG_0000' failed with ORA-01207 Rereading datafile 40 header from mirror side 'DATADG_0005' failed with ORA-01207 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_dbw0_115709.trc: ORA-01186: file 40 failed verification tests ORA-01122: database file 40 failed verification check ORA-01110: data file 40: '+DATADG/xxxxx/datafile/mscsz.dbf' ORA-01207: file is more recent than control file - old control file File 40 not verified due to error ORA-01122 ....... Mon Apr 27 02:02:07 2020 QMNC started with pid=46, OS id=115887 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_115771.trc (incident=299881): ORA-01578: ORACLE data block corrupted (file # 1, block # 3620) ORA-01110: data file 1: '+DATADG/xxxxx/datafile/system.259.956077225' Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_299881/xxxxx1_ora_115771_i299881.trc Mon Apr 27 02:02:16 2020 Dumping diagnostic data in directory=[cdmp_20200427020216], requested by (instance=1, osid=115771), summary=[incident=299881]. Mon Apr 27 02:02:17 2020 Sweep [inc][299881]: completed Mon Apr 27 02:02:18 2020 Completed: ALTER DATABASE OPEN /* db agent *//* {0:13:1899} */ Mon Apr 27 02:02:18 2020 ORA-21700 encountered when generating server alert SMG-4120 ORA-21700 encountered when generating server alert SMG-4121 ORA-21700 encountered when generating server alert SMG-4121 Mon Apr 27 02:02:20 2020 Starting background process CJQ0 Mon Apr 27 02:02:20 2020 CJQ0 started with pid=54, OS id=115990 Mon Apr 27 02:02:20 2020 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc (incident=300025): ORA-01578: ORACLE data block corrupted (file # 2, block # 101561) ORA-01110: data file 2: '+DATADG/xxxxx/datafile/sysaux.260.956077231' Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300025/xxxxx1_m004_115984_i300025.trc Dumping diagnostic data in directory=[cdmp_20200427020222], requested by (instance=1, osid=115984 (M004)), summary=[incident=300025]. Mon Apr 27 02:02:22 2020 Sweep [inc][300025]: completed Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc (incident=300026): ORA-01578: ORACLE data block corrupted (file # , block # ) Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300026/xxxxx1_m004_115984_i300026.trc Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_m004_115984.trc (incident=300027): ORA-01578: ORACLE data block corrupted (file # , block # ) ORA-06512: at "SYS.PRVT_ADVISOR", line 6144 ORA-06512: at line 1 Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_300027/xxxxx1_m004_115984_i300027.trc Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300026]. Dumping diagnostic data in directory=[cdmp_20200427020224], requested by (instance=1, osid=115984 (M004)), summary=[incident=300027]. Mon Apr 27 02:02:46 2020 Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc Corrupt block relative dba: 0x0040613f (file 1, block 24895) Bad header found during buffer read Data in bad block: type: 6 format: 2 rdba: 0x00405c3f last change scn: 0x0000.0002a412 seq: 0x1 flg: 0x06 spare1: 0x0 spare2: 0x0 spare3: 0x0 consistency value in tail: 0xa4120601 check value in block header: 0xa486 computed block checksum: 0x0 Reading datafile '+DATADG/xxxxx/datafile/system.259.956077225' for corruption at rdba: 0x0040613f (file 1, block 24895) Read datafile mirror 'DATADG_0001' (file 1, block 24895) found same corrupt data (no logical check) Read datafile mirror 'DATADG_0003' (file 1, block 24895) found valid data Hex dump of (file 1, block 24895) in trace file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_ora_116134.trc Repaired corruption at (file 1, block 24895) |
不难看出数据库控制文件和system都出现了异常;这也难怪数据库无法正常打开。期间还出现了ora-00600 [4193]等常见错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
Mon Apr 27 02:55:26 2020 QMNC started with pid=44, OS id=3072 Mon Apr 27 02:55:26 2020 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_mmon_2939.trc (incident=323867): ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_323867/xxxxx1_mmon_2939_i323867.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Completed: ALTER DATABASE OPEN /* db agent *//* {0:17:166} */ ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Mon Apr 27 02:55:29 2020 Dumping diagnostic data in directory=[cdmp_20200427025529], requested by (instance=1, osid=2939 (MMON)), summary=[incident=323867]. Block recovery from logseq 17833, block 46 to scn 1551059886 Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0 Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213 Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217 Block recovery stopped at EOT rba 17833.54.16 Block recovery completed at rba 17833.54.16, scn 0.1551059805 Block recovery from logseq 17833, block 46 to scn 1551059575 Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0 Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213 Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217 Block recovery completed at rba 17833.46.16, scn 0.1551059664 Errors in file /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/trace/xxxxx1_mmon_2939.trc (incident=323868): ORA-00600: internal error code, arguments: [4193], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/xxxxx/xxxxx1/incident/incdir_323868/xxxxx1_mmon_2939_i323868.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Dumping diagnostic data in directory=[cdmp_20200427025534], requested by (instance=1, osid=2939 (MMON)), summary=[incident=323868]. Block recovery from logseq 17833, block 46 to scn 1551059886 Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0 Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213 Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217 Block recovery completed at rba 17833.54.16, scn 0.1551059888 Block recovery from logseq 17833, block 46 to scn 1551059888 Recovery of Online Redo Log: Thread 1 Group 1 Seq 17833 Reading mem 0 Mem# 0: +DATADG/xxxxx/onlinelog/group_1.257.956077213 Mem# 1: +ARCHDG/xxxxx/onlinelog/group_1.257.956077217 Block recovery completed at rba 17833.56.16, scn 0.1551059889 Mon Apr 27 02:55:36 2020 。。。。。。 |
对于此次恢复case总体来讲是比较简单等;这里为提供一下处理思路:
1、首先利用客户的归档和Redo(Redo log客户已copy到了本地进行备份);进行正常的recover database using backup controlfile until cancel操作;
2、由于恢复过程中出现了ora-00600 [3020]错误,因此需要通过recover database allow xx corruption的方式进行;
3、完成恢复之后尝试打开数据库;
4、打开数据库时仍然提示ora-01113和ora-01110错误,即system文件还需要进行恢复;这种情况下只能先强制拉库;通过加入*._allow_resetlogs_corruption=TRUE *._allow_error_simulation=true 等隐含参数即可;
5、加入上述参数后仍然提示undo 存在坏块;由于该数据库版本在报错时不会直接提示具体是哪个回滚段有问题;因此在alter database open resetlogs之前通过10046 event定位error cursor;找到回滚段名称即可;
6、使用_corrupted_rollback_segments参数屏蔽回滚段;
7、open resetlogs数据库成功
8、最后就是重建undo以及处理相关坏块等善后工作。
恢复完成之后,由于客户担心HW超融合环境再次出现故障因此进行了全库备份并进行数据迁移到新平台;至此这个简单的case告一段落。
备注:话说这次客户貌似2年前我帮忙恢复过一次数据库,也是Oracle ASM;没想到2年后一大清早又找来了。再次叮嘱大家,注意数据库备份、注意数据容灾环境建设!
Leave a Reply
You must be logged in to post a comment.