Recover Case: 一个24TB的rac(asm)恢复案例

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

本文链接地址: Recover Case: 一个24TB的rac(asm)恢复案例

前几天某客户的一个核心数据库，大约24TB的rac，asm diskgroup无法mount。经过分析发现是某个disk的一个block坏掉了。其中通过kfed read读取发现确实有问题：

$ kfed read /dev/rdisk/disk392 aun=0 blkn=2 | more
kfbh.endian:                         76 ; 0x000: 0x4c
kfbh.hard:                           86 ; 0x001: 0x56
kfbh.type:                           77 ; 0x002: *** Unknown Enum ***
kfbh.datfmt:                         82 ; 0x003: 0x52
kfbh.block.blk:              1162031153 ; 0x004: blk=1162031153
kfbh.block.obj:               620095014 ; 0x008: file=386598
kfbh.check:                  1426510413 ; 0x00c: 0x5506d24d
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                  524288639 ; 0x018: 0x1f40027f
kfbh.spare2:                          0 ; 0x01c: 0x00000000
60000000000F3200 4C564D52 45433031 24F5E626 5506D24D  [LVMREC01$..&amp;U..M]
60000000000F3210 00000000 00000000 1F40027F 00000000  [.........@......]

$ kfed read /dev/rdisk/disk392 aun=0 blkn=2 | more

kfbh.endian: 76 ; 0x000: 0x4c

kfbh.hard: 86 ; 0x001: 0x56

kfbh.type: 77 ; 0x002: *** Unknown Enum ***

kfbh.datfmt: 82 ; 0x003: 0x52

kfbh.block.blk: 1162031153 ; 0x004: blk=1162031153

kfbh.block.obj: 620095014 ; 0x008: file=386598

kfbh.check: 1426510413 ; 0x00c: 0x5506d24d

kfbh.fcn.base: 0 ; 0x010: 0x00000000

kfbh.fcn.wrap: 0 ; 0x014: 0x00000000

kfbh.spare1: 524288639 ; 0x018: 0x1f40027f

kfbh.spare2: 0 ; 0x01c: 0x00000000

60000000000F3200 4C564D52 45433031 24F5E626 5506D24D [LVMREC01$..&U..M]

60000000000F3210 00000000 00000000 1F40027F 00000000 [.........@......]

我们通过手工构造了一个block，然后进行merge修复之后，我尝试mount diskgroup，发现还是报错，如下：

Fri Oct 28 04:47:56 2016
WARNING: cache read  a corrupt block: group=3(DATA) dsk=49 blk=18 disk=49 (DATA_0049) incarn=3636812057 au=0 blk=18 count=1
Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
NOTE: a corrupted block from group DATA was dumped to /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc
WARNING: cache read (retry) a corrupt block: group=3(DATA) dsk=49 blk=18 disk=49 (DATA_0049) incarn=3636812057 au=0 blk=18 count=1
Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc:
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
ERROR: cache failed to read group=3(DATA) dsk=49 blk=18 from disk(s): 49(DATA_0049)
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]
NOTE: cache initiating offline of disk 49 group DATA
NOTE: process _arb0_+asm1 (21799) initiating offline of disk 49.3636812057 (DATA_0049) with mask 0x7e in group 3
WARNING: Disk 49 (DATA_0049) in group 3 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 3, dsk = 49/0xd8c55919, mask = 0x6a, op = clear
Fri Oct 28 04:47:56 2016
GMON updating disk modes for group 3 at 23 for pid 25, osid 21799
ERROR: Disk 49 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 3)
WARNING: Offline of disk 49 (DATA_0049) in group 3 and mode 0x7f failed on ASM inst 1
Fri Oct 28 04:47:56 2016
NOTE: halting all I/Os to diskgroup 3 (DATA)
Fri Oct 28 04:47:56 2016
NOTE: cache dismounting (not clean) group 3/0x51B5A89F (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 23376, image: oracle@cqracdb1 (B000)
Fri Oct 28 04:47:56 2016
ERROR: ORA-15130 in COD recovery for diskgroup 3/0x51b5a89f (DATA)
ERROR: ORA-15130 thrown in RBAL for group number 3
Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_6465.trc:
ORA-15130: diskgroup "DATA" is being dismounted

Fri Oct 28 04:47:56 2016

WARNING: cache read a corrupt block: group=3(DATA) dsk=49 blk=18 disk=49 (DATA_0049) incarn=3636812057 au=0 blk=18 count=1

Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc:

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]

NOTE: a corrupted block from group DATA was dumped to /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc

WARNING: cache read (retry) a corrupt block: group=3(DATA) dsk=49 blk=18 disk=49 (DATA_0049) incarn=3636812057 au=0 blk=18 count=1

Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_21799.trc:

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]

ERROR: cache failed to read group=3(DATA) dsk=49 blk=18 from disk(s): 49(DATA_0049)

ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483697] [18] [76 != 0]

NOTE: cache initiating offline of disk 49 group DATA

NOTE: process _arb0_+asm1 (21799) initiating offline of disk 49.3636812057 (DATA_0049) with mask 0x7e in group 3

WARNING: Disk 49 (DATA_0049) in group 3 in mode 0x7f is now being taken offline on ASM inst 1

NOTE: initiating PST update: grp = 3, dsk = 49/0xd8c55919, mask = 0x6a, op = clear

Fri Oct 28 04:47:56 2016

GMON updating disk modes for group 3 at 23 for pid 25, osid 21799

ERROR: Disk 49 cannot be offlined, since diskgroup has external redundancy.

ERROR: too many offline disks in PST (grp 3)

WARNING: Offline of disk 49 (DATA_0049) in group 3 and mode 0x7f failed on ASM inst 1

Fri Oct 28 04:47:56 2016

NOTE: halting all I/Os to diskgroup 3 (DATA)

Fri Oct 28 04:47:56 2016

NOTE: cache dismounting (not clean) group 3/0x51B5A89F (DATA)

NOTE: messaging CKPT to quiesce pins Unix process pid: 23376, image: oracle@cqracdb1 (B000)

Fri Oct 28 04:47:56 2016

ERROR: ORA-15130 in COD recovery for diskgroup 3/0x51b5a89f (DATA)

ERROR: ORA-15130 thrown in RBAL for group number 3

Errors in file /oracle/ora11g/crs_base/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_6465.trc:

ORA-15130: diskgroup "DATA" is being dismounted

不难看出，这个错误比较熟悉了。从上述日志来看，第49号disk的第0号AU的第18号block还是有问题，通过kfed读取发现确实也是坏块。跟前面第2号block 一样。这里如法炮制，仍然构造一样的block，然后merge之后，成功将diskgroup mount了。

SQL&gt; alter system set asm_power_limit=0 scope=both;
System altered.
SQL&gt; alter diskgroup data mount;
Diskgroup altered.

SQL> alter system set asm_power_limit=0 scope=both;

System altered.

SQL> alter diskgroup data mount;

Diskgroup altered.

将diskgroup mount之后，我检查数据库，发现crs自动将数据库拉起来了，并且已经open了。然而后续进一步检查发现asm的arb进程仍然在报错：

ARB0 relocating file +DATA.278.794162479 (120 entries)
DDE: Problem Key 'ORA 600 [kfdAuDealloc2]' was flood controlled (0x2) (incident: 1486148)
ORA-00600: internal error code, arguments: [kfdAuDealloc2],85], [278], [14309 [], [], [], [], [], [], [], [], []
OSM metadata struct dump of kfdatb:
kfdatb.aunum:                      7168 ; 0x000: 0x00001c00
kfdatb.shrink:                      448 ; 0x004: 0x01c0
kfdatb.ub2pad:                     7176 ; 0x006: 0x1c08
kfdatb.auinfo[0].link.next:           8 ; 0x008: 0x0008
kfdatb.auinfo[0].link.prev:           8 ; 0x00a: 0x0008
kfdatb.auinfo[1].link.next:          12 ; 0x00c: 0x000c
kfdatb.auinfo[1].link.prev:          12 ; 0x00e: 0x000c
kfdatb.auinfo[2].link.next:          16 ; 0x010: 0x0010
kfdatb.auinfo[2].link.prev:          16 ; 0x012: 0x0010
kfdatb.auinfo[3].link.next:          20 ; 0x014: 0x0014
kfdatb.auinfo[3].link.prev:          20 ; 0x016: 0x0014
kfdatb.auinfo[4].link.next:          24 ; 0x018: 0x0018
kfdatb.auinfo[4].link.prev:          24 ; 0x01a: 0x0018
kfdatb.auinfo[5].link.next:          28 ; 0x01c: 0x001c
kfdatb.auinfo[5].link.prev:          28 ; 0x01e: 0x001c
kfdatb.auinfo[6].link.next:          32 ; 0x020: 0x0020
kfdatb.auinfo[6].link.prev:          32 ; 0x022: 0x0020
kfdatb.spare:                         0 ; 0x024: 0x00000000
Dump of ate#:0
OSM metadata struct dump of kfdate:
kfdate.discriminator:                 1 ; 0x000: 0x00000001
kfdate.allo.lo:                       0 ; 0x000: XNUM=0x0
kfdate.allo.hi:                 8388608 ; 0x004: V=1 I=0 H=0 FNUM=0x0

ARB0 relocating file +DATA.278.794162479 (120 entries)

DDE: Problem Key 'ORA 600 [kfdAuDealloc2]' was flood controlled (0x2) (incident: 1486148)

ORA-00600: internal error code, arguments: [kfdAuDealloc2],85], [278], [14309 [], [], [], [], [], [], [], [], []

OSM metadata struct dump of kfdatb:

kfdatb.aunum: 7168 ; 0x000: 0x00001c00

kfdatb.shrink: 448 ; 0x004: 0x01c0

kfdatb.ub2pad: 7176 ; 0x006: 0x1c08

kfdatb.auinfo[0].link.next: 8 ; 0x008: 0x0008

kfdatb.auinfo[0].link.prev: 8 ; 0x00a: 0x0008

kfdatb.auinfo[1].link.next: 12 ; 0x00c: 0x000c

kfdatb.auinfo[1].link.prev: 12 ; 0x00e: 0x000c

kfdatb.auinfo[2].link.next: 16 ; 0x010: 0x0010

kfdatb.auinfo[2].link.prev: 16 ; 0x012: 0x0010

kfdatb.auinfo[3].link.next: 20 ; 0x014: 0x0014

kfdatb.auinfo[3].link.prev: 20 ; 0x016: 0x0014

kfdatb.auinfo[4].link.next: 24 ; 0x018: 0x0018

kfdatb.auinfo[4].link.prev: 24 ; 0x01a: 0x0018

kfdatb.auinfo[5].link.next: 28 ; 0x01c: 0x001c

kfdatb.auinfo[5].link.prev: 28 ; 0x01e: 0x001c

kfdatb.auinfo[6].link.next: 32 ; 0x020: 0x0020

kfdatb.auinfo[6].link.prev: 32 ; 0x022: 0x0020

kfdatb.spare: 0 ; 0x024: 0x00000000

Dump of ate#:0

OSM metadata struct dump of kfdate:

kfdate.discriminator: 1 ; 0x000: 0x00000001

kfdate.allo.lo: 0 ; 0x000: XNUM=0x0

kfdate.allo.hi: 8388608 ; 0x004: V=1 I=0 H=0 FNUM=0x0

虽然已经不影响数据库的正常运行，然而由于arb进程的异常，导致reblance操作实际上没有进行完成，客户新加的磁盘基本上没有被使用，导致diskgroup的磁盘使用不均衡。

这个错误看上去很复杂，实际上很简单。根据后面的序号，我们可以判断，本质来讲是因为我们前面手工构造的2个block其实并不完整，这是allocate table，需要将后面的kfdate数据都构造完毕，才能让arb进程正常工作下午。

不过熊爷已经在修改odu代码了，准备odu来搞定这个遗留问题。看来ODU以后将具备修复ASM元数据的功能了。很强悍！

Tags: ora-00600 kfdAuDealloc2, ORA-15130, ORA-15196
Posted in ASM on 10月 30, 2016

You must be logged in to post a comment.

love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

Categories

Archives

最新评论

国内圈子

oracle security

Recover Case: 一个24TB的rac(asm)恢复案例

Leave a Reply