记一次TB级别的Exadata数据库恢复
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 记一次TB级别的Exadata数据库恢复
某客户的exadata环境,由于大量硬件故障,导致数据库宕机。经过分析确认是由于某个cell节点的某个disk出现异常,导致部分
cell节点的disk 离线后无法进行asm reblance;最终导致diskgroup 被强制dismount了。如下是alert log:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
Fri Oct 09 15:44:26 2015 NOTE: process _user4655_+asm1 (4655) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1 NOTE: checking PST: grp = 1 ERROR: Disk 18 cannot be offlined, since all the disks [18, 49] with mirrored data would be offline. ERROR: too many offline disks in PST (grp 2) GMON checking disk modes for group 1 at 55 for pid 41, osid 4655 Fri Oct 09 15:44:26 2015 NOTE: cache dismounting (not clean) group 2/0x4AD8D253 (DBFS_DG) ERROR: too many offline disks in PST (grp 1) NOTE: checking PST for grp 1 done. WARNING: Disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1 NOTE: initiating PST update: grp = 1, dsk = 46/0xe968239f, mode = 0x6a, op = 4 GMON updating disk modes for group 1 at 56 for pid 41, osid 4655 ERROR: Disk 46 cannot be offlined, since all the disks [46, 72] with mirrored data would be offline. ERROR: too many offline disks in PST (grp 1) WARNING: Offline of disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 and mode 0x7f failed on ASM inst 1 WARNING: Disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 mode 0x7f offline is being aborted Fri Oct 09 15:44:26 2015 NOTE: process _user4663_+asm1 (4663) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1 NOTE: checking PST: grp = 1 WARNING: Offline of disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 and mode 0x7f failed on ASM inst 1 Fri Oct 09 15:44:26 2015 NOTE: halting all I/Os to diskgroup 2 (DBFS_DG) NOTE: unable to offline disks after getting write error for diskgroup DBFS_DG NOTE: write to disk 32 succeeded NOTE: disk 18 had IO error NOTE: write to disk 6 succeeded NOTE: halting all I/Os to diskgroup 1 (DATA_KLYX) NOTE: LGWR doing non-clean dismount of group 1 (DATA_KLYX) NOTE: LGWR sync ABA=21.5465 last written ABA 21.5465 GMON checking disk modes for group 1 at 57 for pid 44, osid 4663 ERROR: too many offline disks in PST (grp 1) NOTE: checking PST for grp 1 done. |
我们可以看到,cell 2节点的第10号disk存在异常,通过cellcli命令行也可以进一步确认:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CellCLI> list physicaldisk; 20:0 E07Q5N normal 20:1 E05SMD normal 20:2 E07GV6 normal 20:3 E060YD normal 20:4 E05ZQE normal 20:5 E07L9M normal 20:6 E07FED normal 20:7 E05N88 normal 20:8 E08DY5 normal 20:9 E06ADQ normal 20:10 E08E5C critical 20:11 E08DYF normal |
通过对该磁盘进行修复检查,确实也发现了坏道,如下所示:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
SEAGATE ST360057SSUN600G --> SEAGATE ST360057SSUN600G Sector 0 --> Sector 0 1,172,123,568 Sectors Sectors that could not be read: 102,578,137 102,578,139 102,578,140 102,578,141 102,578,142 102,578,149 102,578,152 ...... 1,172,123,561 sector(s) successfully copied. 7 bad source sectors encountered. Corresponding destination sectors filled with: UNREADABLESECTOR |
修复完毕后,成功mount diskgroup。但是最后检查数据库存在部分数据坏块,如下所示:
1 2 3 4 5 6 7 8 9 10 |
SQL> select * from V$DATABASE_BLOCK_CORRUPTION; FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE ---------- ---------- ---------- ------------------ ----------------- 1 5186816 3 0 CORRUPT 1 5186813 1 0 CORRUPT 1 3343734 2 0 CORRUPT 25 2391372 1 1.4561E+13 CORRUPT 35 538916 1 0 CORRUPT 36 543804 3 0 CORRUPT |
对于数据坏块的修复相对比较简单,只是由于是asm环境处理起来相对麻烦一些,加上客户没有备份,所以无法进行blockcover。对于system的几个数据坏块,经过检查发现为审计对象,因此通过truncate然后insert数据即可格式化掉;
另外第25号文件的数据坏块为逻辑坏块,通过查询确认为index,直接rebuild index即可。
至于说最后第第35,36号文件,由于均为数据文件,涉及的对象均为分区表的partiiton,因此处理相对麻烦一些。
这里我通过创建测试表,然后清理掉数据,构造一个数据空块,然后将该数据块复制到asm diskgroup中替换到这几个坏块即可,注意替换之前,需要修改数据坏块的rdba地址和obj id信息,如下是数据块的复制替换过程:
1 2 3 4 5 6 7 8 9 10 11 |
SQL> @copy_block_to_asm.sql Enter value for file_with_patched_block: /u01/.../blk.dbf old 15: v_FsFileName := '&file_with_patched_block'; new 15: v_FsFileName := '/u01/.../blk.dbf'; Enter value for file_to_patch_in_asm: +DATA_KLYX/xxxx1/datafile/epmrange107.dbf old 16: v_AsmFileName := '&file_to_patch_in_ASM'; new 16: v_AsmFileName := '+DATA_KLYX/xxxx1/datafile/epmrange107.dbf'; Enter value for block_to_patch: 543804 old 17: v_offstart := '&block_to_patch'; new 17: v_offstart := '543804'; PL/SQL procedure successfully completed. |
最后通过rman检查,确认一切正常。如下是rman的检测结果:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
....... RMAN> backup validate datafile 36; Starting backup at 16-OCT-15 using target database control file instead of recovery catalog allocated channel: ORA_DISK_1 channel ORA_DISK_1: SID=1337 instance=xxx1 device type=DISK channel ORA_DISK_1: starting full datafile backup set channel ORA_DISK_1: specifying datafile(s) in backup set input datafile file number=00036 name=+DATA_KLYX/xxx1/datafile/epmrange107.dbf channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15 List of Datafiles ================= File Status Marked Corrupt Empty Blocks Blocks Examined High SCN ---- ------ -------------- ------------ --------------- ---------- 36 OK 0 315258 3276800 14561048845049 File Name: +DATA_KLYX/xxx1/datafile/epmrange107.dbf Block Type Blocks Failing Blocks Processed ---------- -------------- ---------------- Data 0 2926354 Index 0 20 Other 0 35168 ...... |
Leave a Reply
You must be logged in to post a comment.