记一次TB级别的Exadata数据库恢复

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

某客户的exadata环境，由于大量硬件故障，导致数据库宕机。经过分析确认是由于某个cell节点的某个disk出现异常，导致部分

cell节点的disk 离线后无法进行asm reblance；最终导致diskgroup 被强制dismount了。如下是alert log：

Fri Oct 09 15:44:26 2015
NOTE: process _user4655_+asm1 (4655) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
ERROR: Disk 18 cannot be offlined, since all the disks [18, 49] with mirrored data would be offline.
ERROR: too many offline disks in PST (grp 2)
GMON checking disk modes for group 1 at 55 for pid 41, osid 4655
Fri Oct 09 15:44:26 2015
NOTE: cache dismounting (not clean) group 2/0x4AD8D253 (DBFS_DG)
ERROR: too many offline disks in PST (grp 1)
NOTE: checking PST for grp 1 done.
WARNING: Disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 1, dsk = 46/0xe968239f, mode = 0x6a, op = 4
GMON updating disk modes for group 1 at 56 for pid 41, osid 4655
ERROR: Disk 46 cannot be offlined, since all the disks [46, 72] with mirrored data would be offline.
ERROR: too many offline disks in PST (grp 1)
WARNING: Offline of disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 and mode 0x7f failed on ASM inst 1
WARNING: Disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 mode 0x7f offline is being aborted
Fri Oct 09 15:44:26 2015
NOTE: process _user4663_+asm1 (4663) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1
NOTE: checking PST: grp = 1
WARNING: Offline of disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 and mode 0x7f failed on ASM inst 1
Fri Oct 09 15:44:26 2015
NOTE: halting all I/Os to diskgroup 2 (DBFS_DG)
NOTE: unable to offline disks after getting write error for diskgroup DBFS_DG
NOTE: write to disk 32 succeeded
NOTE: disk 18 had IO error
NOTE: write to disk 6 succeeded
NOTE: halting all I/Os to diskgroup 1 (DATA_KLYX)
NOTE: LGWR doing non-clean dismount of group 1 (DATA_KLYX)
NOTE: LGWR sync ABA=21.5465 last written ABA 21.5465
GMON checking disk modes for group 1 at 57 for pid 44, osid 4663
ERROR: too many offline disks in PST (grp 1)
NOTE: checking PST for grp 1 done.

Fri Oct 09 15:44:26 2015

NOTE: process _user4655_+asm1 (4655) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1

NOTE: checking PST: grp = 1

ERROR: Disk 18 cannot be offlined, since all the disks [18, 49] with mirrored data would be offline.

ERROR: too many offline disks in PST (grp 2)

GMON checking disk modes for group 1 at 55 for pid 41, osid 4655

Fri Oct 09 15:44:26 2015

NOTE: cache dismounting (not clean) group 2/0x4AD8D253 (DBFS_DG)

ERROR: too many offline disks in PST (grp 1)

NOTE: checking PST for grp 1 done.

WARNING: Disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 in mode 0x7f is now being taken offline on ASM inst 1

NOTE: initiating PST update: grp = 1, dsk = 46/0xe968239f, mode = 0x6a, op = 4

GMON updating disk modes for group 1 at 56 for pid 41, osid 4655

ERROR: Disk 46 cannot be offlined, since all the disks [46, 72] with mirrored data would be offline.

ERROR: too many offline disks in PST (grp 1)

WARNING: Offline of disk 46 (DATA_KLYX_CD_10_KLYXCEL02) in group 1 and mode 0x7f failed on ASM inst 1

WARNING: Disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 mode 0x7f offline is being aborted

Fri Oct 09 15:44:26 2015

NOTE: process _user4663_+asm1 (4663) initiating offline of disk 46.3915916191 (DATA_KLYX_CD_10_KLYXCEL02) with mask 0x7e in group 1

NOTE: checking PST: grp = 1

WARNING: Offline of disk 18 (DBFS_DG_CD_10_KLYXCEL02) in group 2 and mode 0x7f failed on ASM inst 1

Fri Oct 09 15:44:26 2015

NOTE: halting all I/Os to diskgroup 2 (DBFS_DG)

NOTE: unable to offline disks after getting write error for diskgroup DBFS_DG

NOTE: write to disk 32 succeeded

NOTE: disk 18 had IO error

NOTE: write to disk 6 succeeded

NOTE: halting all I/Os to diskgroup 1 (DATA_KLYX)

NOTE: LGWR doing non-clean dismount of group 1 (DATA_KLYX)

NOTE: LGWR sync ABA=21.5465 last written ABA 21.5465

GMON checking disk modes for group 1 at 57 for pid 44, osid 4663

ERROR: too many offline disks in PST (grp 1)

NOTE: checking PST for grp 1 done.

我们可以看到，cell 2节点的第10号disk存在异常，通过cellcli命令行也可以进一步确认：

CellCLI&gt; list physicaldisk;
20:0            E07Q5N          normal
20:1            E05SMD          normal
20:2            E07GV6          normal
20:3            E060YD          normal
20:4            E05ZQE          normal
20:5            E07L9M          normal
20:6            E07FED          normal
20:7            E05N88          normal
20:8            E08DY5          normal
20:9            E06ADQ          normal
20:10           E08E5C          critical
20:11           E08DYF          normal

CellCLI> list physicaldisk;

20:0 E07Q5N normal

20:1 E05SMD normal

20:2 E07GV6 normal

20:3 E060YD normal

20:4 E05ZQE normal

20:5 E07L9M normal

20:6 E07FED normal

20:7 E05N88 normal

20:8 E08DY5 normal

20:9 E06ADQ normal

20:10 E08E5C critical

20:11 E08DYF normal

通过对该磁盘进行修复检查，确实也发现了坏道，如下所示：

SEAGATE ST360057SSUN600G --&gt; SEAGATE ST360057SSUN600G
Sector 0 --&gt; Sector 0
1,172,123,568 Sectors
Sectors that could not be read:
102,578,137
102,578,139
102,578,140
102,578,141
102,578,142
102,578,149
102,578,152
......
1,172,123,561 sector(s) successfully copied.
7 bad source sectors encountered.
Corresponding destination sectors filled with: UNREADABLESECTOR

SEAGATE ST360057SSUN600G --> SEAGATE ST360057SSUN600G

Sector 0 --> Sector 0

1,172,123,568 Sectors

Sectors that could not be read:

102,578,137

102,578,139

102,578,140

102,578,141

102,578,142

102,578,149

102,578,152

......

1,172,123,561 sector(s) successfully copied.

7 bad source sectors encountered.

Corresponding destination sectors filled with: UNREADABLESECTOR

修复完毕后，成功mount diskgroup。但是最后检查数据库存在部分数据坏块，如下所示：

SQL&gt;  select * from  V$DATABASE_BLOCK_CORRUPTION;
FILE#     BLOCK#     BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE
---------- ---------- ---------- ------------------ -----------------
1    5186816          3                  0 CORRUPT
1    5186813          1                  0 CORRUPT
1    3343734          2                  0 CORRUPT
25    2391372          1         1.4561E+13 CORRUPT
35     538916          1                  0 CORRUPT
36     543804          3                  0 CORRUPT

SQL> select * from V$DATABASE_BLOCK_CORRUPTION;

FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTION_TYPE

---------- ---------- ---------- ------------------ -----------------

1 5186816 3 0 CORRUPT

1 5186813 1 0 CORRUPT

1 3343734 2 0 CORRUPT

25 2391372 1 1.4561E+13 CORRUPT

35 538916 1 0 CORRUPT

36 543804 3 0 CORRUPT

对于数据坏块的修复相对比较简单，只是由于是asm环境处理起来相对麻烦一些，加上客户没有备份，所以无法进行blockcover。对于system的几个数据坏块，经过检查发现为审计对象，因此通过truncate然后insert数据即可格式化掉；

另外第25号文件的数据坏块为逻辑坏块，通过查询确认为index，直接rebuild index即可。

至于说最后第第35，36号文件，由于均为数据文件，涉及的对象均为分区表的partiiton，因此处理相对麻烦一些。

这里我通过创建测试表，然后清理掉数据，构造一个数据空块，然后将该数据块复制到asm diskgroup中替换到这几个坏块即可，注意替换之前，需要修改数据坏块的rdba地址和obj id信息，如下是数据块的复制替换过程：

SQL&gt; @copy_block_to_asm.sql
Enter value for file_with_patched_block: /u01/.../blk.dbf
old  15: v_FsFileName := '&amp;file_with_patched_block';
new  15: v_FsFileName := '/u01/.../blk.dbf';
Enter value for file_to_patch_in_asm: +DATA_KLYX/xxxx1/datafile/epmrange107.dbf
old  16: v_AsmFileName := '&amp;file_to_patch_in_ASM';
new  16: v_AsmFileName := '+DATA_KLYX/xxxx1/datafile/epmrange107.dbf';
Enter value for block_to_patch: 543804
old  17: v_offstart := '&amp;block_to_patch';
new  17: v_offstart := '543804';
PL/SQL procedure successfully completed.

SQL> @copy_block_to_asm.sql

Enter value for file_with_patched_block: /u01/.../blk.dbf

old 15: v_FsFileName := '&file_with_patched_block';

new 15: v_FsFileName := '/u01/.../blk.dbf';

Enter value for file_to_patch_in_asm: +DATA_KLYX/xxxx1/datafile/epmrange107.dbf

old 16: v_AsmFileName := '&file_to_patch_in_ASM';

new 16: v_AsmFileName := '+DATA_KLYX/xxxx1/datafile/epmrange107.dbf';

Enter value for block_to_patch: 543804

old 17: v_offstart := '&block_to_patch';

new 17: v_offstart := '543804';

PL/SQL procedure successfully completed.

最后通过rman检查，确认一切正常。如下是rman的检测结果：

.......
RMAN&gt; backup validate datafile 36;
Starting backup at 16-OCT-15
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1337 instance=xxx1 device type=DISK
channel ORA_DISK_1: starting full datafile backup set
channel ORA_DISK_1: specifying datafile(s) in backup set
input datafile file number=00036 name=+DATA_KLYX/xxx1/datafile/epmrange107.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
List of Datafiles
=================
File Status Marked Corrupt Empty Blocks Blocks Examined High SCN
---- ------ -------------- ------------ --------------- ----------
36   OK     0              315258       3276800         14561048845049
File Name: +DATA_KLYX/xxx1/datafile/epmrange107.dbf
Block Type Blocks Failing Blocks Processed
---------- -------------- ----------------
Data       0              2926354
Index      0              20
Other      0              35168
......

.......

RMAN> backup validate datafile 36;

Starting backup at 16-OCT-15

using target database control file instead of recovery catalog

allocated channel: ORA_DISK_1

channel ORA_DISK_1: SID=1337 instance=xxx1 device type=DISK

channel ORA_DISK_1: starting full datafile backup set

channel ORA_DISK_1: specifying datafile(s) in backup set

input datafile file number=00036 name=+DATA_KLYX/xxx1/datafile/epmrange107.dbf

channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15

List of Datafiles

=================

File Status Marked Corrupt Empty Blocks Blocks Examined High SCN

---- ------ -------------- ------------ --------------- ----------

36 OK 0 315258 3276800 14561048845049

File Name: +DATA_KLYX/xxx1/datafile/epmrange107.dbf

Block Type Blocks Failing Blocks Processed

---------- -------------- ----------------

Data 0 2926354

Index 0 20

Other 0 35168

......

Tags: CORRUPT BLOCK, exadata recover, ORA-15080, ORA-15130, ORA-15131
Posted in backup & rcovery on 10月 21, 2015

You must be logged in to post a comment.

love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

Categories

Archives

最新评论

国内圈子

oracle security

记一次TB级别的Exadata数据库恢复

Leave a Reply

love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

Categories

Archives

最新评论

国内圈子

oracle security

记一次TB级别的Exadata数据库恢复

Leave a Reply

记一次TB级别的Exadata数据库恢复