都是换盘惹的祸
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 都是换盘惹的祸
前不久某客户的分布式Oracle 集群环境(我司的zdata分布式架构);某个存储节点因为内存损坏并进行更换后;后续客户方工程师发现另外一个节点有个硬盘报警;因此就更换了该磁盘;悲剧从此开始。。。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
Tue Mar 03 11:33:25 2020 NOTE: disk 85 (DG_DATA01_0085) in group 1 (DG_DATA01) is locally offline for writes Tue Mar 03 11:33:25 2020 NOTE: successfully wrote at least one mirror side for diskgroup DG_DATA01 Tue Mar 03 11:33:25 2020 ...... Tue Mar 03 11:40:49 2020 NOTE: process _user43764_+asm1 (43764) initiating offline of disk 35.3916024611 (DG_DATA01_0035) with mask 0x7e in group 1 (DG_DATA01) with client assisting NOTE: checking PST: grp = 1 Tue Mar 03 11:40:49 2020 GMON checking disk modes for group 1 at 195 for pid 33, osid 43764 Tue Mar 03 11:40:49 2020 ERROR: too many offline disks in PST (grp 1) Tue Mar 03 11:40:49 2020 ...... NOTE: cache closing disk 122 of grp 1: (not open) DG_DATA01_0122 ERROR: disk 35(DG_DATA01_0035) in group 1(DG_DATA01) cannot be offlined because all disks [35(DG_DATA01_0035), 83(DG_DATA01_0083)] with mirrored data would be offline. Tue Mar 03 11:40:49 2020 ERROR: too many offline disks in PST (grp 1) Tue Mar 03 11:40:49 2020 NOTE: cache dismounting (not clean) group 1/0xFF593AAC (DG_DATA01) NOTE: messaging CKPT to quiesce pins Unix process pid: 94856, image: oracle@zhjqc01 (B000) Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 0 of grp 1: (not open) DG_DATA01_0000 Tue Mar 03 11:40:49 2020 NOTE: halting all I/Os to diskgroup 1 (DG_DATA01) Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 80 of grp 1: (not open) DG_DATA01_0080 Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 82 of grp 1: (not open) DG_DATA01_0082 Tue Mar 03 11:40:49 2020 WARNING: Offline of disk 35 (DG_DATA01_0035) in group 1 and mode 0x7f failed on ASM inst 1 Tue Mar 03 11:40:49 2020 NOTE: cache closing disk 83 of grp 1: (not open) DG_DATA01_0083 Tue Mar 03 11:40:50 2020 ...... NOTE: LGWR doing non-clean dismount of group 1 (DG_DATA01) thread 2 NOTE: LGWR sync ABA=583.8239 last written ABA 583.8239 Tue Mar 03 11:40:50 2020 NOTE: initiating dirty detach from lock domain 1 Tue Mar 03 11:40:50 2020 ....... Tue Mar 03 11:40:50 2020 Dirty Detach Reconfiguration complete (total time 0.3 secs) ASM Health Checker found 1 new failures Tue Mar 03 11:40:51 2020 ERROR: ORA-15130 in COD recovery for diskgroup 1/0xff593aac (DG_DATA01) ERROR: ORA-15130 thrown in RBAL for group number 1 Tue Mar 03 11:40:51 2020 Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_55928.trc: ORA-15130: diskgroup "DG_DATA01" is being dismounted Tue Mar 03 11:40:54 2020 |
此时如果你收工mount磁盘组会报缺少磁盘。
1 2 3 4 5 6 7 8 9 10 |
ERROR: diskgroup DG_DATA01 was not mounted ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15066: offlining disk "35" in group "DG_DATA01" may result in a data loss ORA-15042: ASM disk "35" is missing from group number "1" ORA-15080: synchronous I/O operation failed to read block 0 of disk 13 in disk group ORA-27061: waiting for async I/Os failed Linux-x86_64 Error: 5: Input/output error Additional information: 4294967295 Additional information: 4096 |
这里缺少的磁盘其实就是之前被更换的磁盘。我们进一步分析数据库日志发现,其实之前有些节点的磁盘被自动offline drop后;其实数据库正在进行rebalance操作;然而这个时候又重启了存储节点(因为换内存);在该节点还没重启完成之前;另外一个节点被换了磁盘导致数据库直接crash了。
可能说起来有点绕口;简单解释一下;数据是normal冗余,即有1个副本;解释一条数据A和其副本A1 分别在2个存储节点上;当一个节点重启后,其磁盘还没被集群所识别到时,另外一个节点恰好换盘,而A1副本可能刚好落在这个盘上。由于数据不完整;那么db肯定立马crash。
当然最后处理比较简单;将该磁盘再插回机器后,重新加入到mount force diskgroup发现可以正常启动;而数据库稍后也开始进行rebalance了。
由于客户的数据库比较大,60TB左右;因此整个处理时间较长;很多协调的工作,再加上客户机房也非常远,往返2小时。
针对这个case;事后我们进行了复盘验证;其实可以使用更快的处理方法解决;当然存在一定风险;因为这涉及到对于asm元数据的处理;主要处理pst即可。
如下是简单测试过程。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
SQL> alter diskgroup test mount force; alter diskgroup test mount force * ERROR at line 1: ORA-15032: not all alterations performed ORA-15040: diskgroup is incomplete ORA-15066: offlining disk "1" in group "TEST" may result in a data loss ORA-15042: ASM disk "1" is missing from group number "2" =============== PST ==================== grpNum: 2 state: 2 callCnt: 91 (lockvalue) valid=1 ver=0.0 ndisks=2 flags=0x3 from inst=0 (I am 1) last=0 --------------- HDR -------------------- next: 46 last: 46 pst count: 2 pst locations: 2 0 incarn: 45 dta size: 6 version: 1 ASM version: 186646528 = 11.2.0.0.0 contenttype: 1 partnering pattern: [ ] --------------- LOC MAP ---------------- 0: dirty 0 cur_loc: 0 stable_loc: 0 1: dirty 1 cur_loc: 1 stable_loc: 1 --------------- DTA -------------------- 0: sts v v(rw) p(rw) a(x) d(x) fg# = 1 addTs = 2429200834 parts: 5 (amp) 4 (amp) 3 (amp) 2 (amp) 1: sts v v(-w) p(-w) a(-) d(-) fg# = 1 addTs = 2429200834 parts: 4 (amp) 5 (amp) 2 (amp) 3 (amp) 2: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 1 (amp) 4 (amp) 5 (amp) 0 (amp) 3: sts v v(rw) p(rw) a(x) d(x) fg# = 2 addTs = 2429386451 parts: 5 (amp) 0 (amp) 1 (amp) 4 (amp) 4: sts v v(--) p(--) a(-) d(-) fg# = 3 addTs = 2429203972 parts: 1 (amp) 0 (amp) 2 (amp) 3 (amp) 5: sts v v(--) p(--) a(-) d(-) fg# = 3 addTs = 2429203972 parts: 0 (amp) 1 (amp) 3 (amp) 2 (amp) [grid@lx1 ~]$ kfed read /dev/asm-diske aun=1 blkn=2|grep "status"|grep -v "I=0" kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[4].status: 21 ; 0x0c0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 kfdpDtaEv1[5].status: 21 ; 0x0f0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 [grid@lx1 ~]$ kfed read /dev/asm-diskc aun=1 blkn=2|grep "status"|grep -v "I=0" kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[4].status: 21 ; 0x0c0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 kfdpDtaEv1[5].status: 21 ; 0x0f0: I=1 V=0 V=1 P=0 P=1 A=0 D=0 [grid@test ~]$ kfed read /dev/asm-diskh aun=1 blkn=2|grep "status"|grep -v "I=0" kfdpDtaEv1[0].status: 127 ; 0x000: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[1].status: 127 ; 0x030: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[2].status: 127 ; 0x060: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[3].status: 127 ; 0x090: I=1 V=1 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[4].status: 127 ; 0x0c0: I=1 V=0 V=1 P=1 P=1 A=1 D=1 kfdpDtaEv1[5].status: 127 ; 0x0f0: I=1 V=1 V=1 P=1 P=1 A=1 D=1 [grid@test ~]$ kfed read /dev/asm-diskh aun=1 blkn=2|grep "status"|grep -v "I=0" > repair.txt [grid@test ~]$ kfed merge /dev/asm-diske aun=1 blkn=2 text=repair.txt [grid@test ~]$ kfed merge /dev/asm-diskc aun=1 blkn=2 text=repair.txt [grid@test ~]$ kfed merge /dev/asm-diske aun=1 blkn=3 text=repair.txt [grid@test ~]$ kfed merge /dev/asm-diskc aun=1 blkn=3 text=repair.txt SQL> alter diskgroup test mount force; Diskgroup altered. |
当然生产库的操作建议慎重;如需帮助请联系我们。
Leave a Reply
You must be logged in to post a comment.