一个比较特殊的Oracle Rac案例
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 一个比较特殊的Oracle Rac案例
前不久某客户一套Oracle RAC集群出现异常;当某个节点故障后,出现多个实例不断反复重启的情况(该rac环境上运行着7个Oracle 实例)。
最终会出现启动节点1 asm;数据会自动将节点2 实例全部驱逐挂掉。
我们首先来看最新的情况;其中一个核心实例crash:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
NOTE: ASMB terminating Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_asmb_12418.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 2047 Serial number: 21 Errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_asmb_12418.trc: ORA-15064: communication failure with ASM instance ORA-03113: end-of-file on communication channel Process ID: Session ID: 2047 Serial number: 21 Tue Mar 16 11:44:44 2021 System state dump requested by (instance=2, osid=12418 (ASMB)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/xxxx/xxxx2/trace/xxxx2_diag_12183_20210316114444.trc ASMB (ospid: 12418): terminating the instance due to error 15064 Tue Mar 16 11:44:48 2021 ORA-1092 : opitsk aborting process Instance terminated by ASMB, pid = 12418 Tue Mar 16 11:44:50 2021 ORA-1092 : opitsk aborting process Tue Mar 16 11:48:19 2021 Adjusting the default value of parameter parallel_max_servers from 3200 to 1470 due to the value of parameter processes (1500) Starting ORACLE instance (normal) |
从上面信息来看,数据实例和ASM实例失去了通信:数据库实例被asmb进程终止;进一步分析asm日志发现;在11:44:44秒时间点,该节点asm实例被强行终止,因此导致该节点上所有数据库实例均被强行终止。
我们继续来分析asm实例告警:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Mon Mar 15 17:02:48 2021 NOTE: client epaydb2:epaydb registered, osid 17050, mbr 0x6 Mon Mar 15 22:22:15 2021 NOTE: client rbcg2:rbcg registered, osid 46610, mbr 0x7 Tue Mar 16 11:44:44 2021 Communications reconfiguration: instance_number 1 Instance termination initiated by instance 1 with reason 1. Instance 1 received a reconfig event from its cluster manager indicating that this instance is supposed to be down Please check instance 1's alert log and LMON trace file for more details. Please also examine the CSS log files. Tue Mar 16 11:44:44 2021 System state dump requested by (instance=2, osid=48578 (LMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_diag_48571_20210316114444.trc LMON (ospid: 48578): terminating the instance due to error 481 Instance terminated by LMON, pid = 48578 Tue Mar 16 11:44:51 2021 |
从节点2 asm日志来看,节点2asm实例被强行终止,是接收到节点1的终止信息;故进一步分析节点1 告警日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
Mon Mar 15 16:58:26 2021 Instance shutdown complete Tue Mar 16 11:42:12 2021 NOTE: No asm libraries found in the system MEMORY_TARGET defaulting to 1128267776. * instance_number obtained from CSS = 1, checking for the existence of node 0... * node 0 does not exist. instance_number = 1 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 80 Number of processor cores in the system is 40 ............ Tue Mar 16 11:42:14 2021 SMON started with pid=17, OS id=36978 Tue Mar 16 11:42:14 2021 RBAL started with pid=18, OS id=36980 Tue Mar 16 11:42:14 2021 GMON started with pid=19, OS id=36982 Tue Mar 16 11:42:14 2021 MMON started with pid=20, OS id=36984 Tue Mar 16 11:42:14 2021 MMNL started with pid=21, OS id=36986 lmon registered with NM - instance number 1 (internal mem no 0) Tue Mar 16 11:44:44 2021 Communications reconfiguration: instance_number 2 Tue Mar 16 11:44:44 2021 Dumping diagnostic data in directory=[cdmp_20210316114444], requested by (instance=2, osid=48578 (LMON)), summary=[abnormal instance termination]. Reconfiguration started (old inc 0, new inc 194) ASM instance List of instances: 1 (myinst: 1) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Tue Mar 16 11:44:49 2021 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Tue Mar 16 11:44:50 2021 LCK0 started with pid=22, OS id=37364 Tue Mar 16 11:44:52 2021 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Tue Mar 16 11:44:52 2021 SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ NOTE: Diskgroups listed in ASM_DISKGROUPS are DATA ARCDG NOTE: Diskgroup used for Voting files is: OCRVDISK Diskgroup with spfile:OCRVDISK Diskgroup used for OCR is:OCRVDISK NOTE: cache registered group ARCDG number=1 incarn=0x91c30b10 NOTE: cache began mount (first) of group ARCDG number=1 incarn=0x91c30b10 NOTE: cache registered group DATA number=2 incarn=0x91d30b11 NOTE: cache began mount (first) of group DATA number=2 incarn=0x91d30b11 NOTE: cache registered group OCRVDISK number=3 incarn=0x91d30b12 NOTE: cache began mount (first) of group OCRVDISK number=3 incarn=0x91d30b12 NOTE: Assigning number (1,9) to disk (/dev/raw/raw35) NOTE: Assigning number (1,5) to disk (/dev/raw/raw31) NOTE: Assigning number (1,8) to disk (/dev/raw/raw34) NOTE: Assigning number (1,6) to disk (/dev/raw/raw32) NOTE: Assigning number (1,7) to disk (/dev/raw/raw33) NOTE: Assigning number (2,11) to disk (/dev/raw/raw30) NOTE: Assigning number (2,8) to disk (/dev/raw/raw27) NOTE: Assigning number (2,10) to disk (/dev/raw/raw29) NOTE: Assigning number (2,6) to disk (/dev/raw/raw25) NOTE: Assigning number (2,9) to disk (/dev/raw/raw28) NOTE: Assigning number (2,5) to disk (/dev/raw/raw24) NOTE: Assigning number (2,7) to disk (/dev/raw/raw26) NOTE: Assigning number (3,4) to disk (/dev/raw/raw22) NOTE: Assigning number (3,3) to disk (/dev/raw/raw21) NOTE: Assigning number (3,5) to disk (/dev/raw/raw23) Tue Mar 16 11:44:58 2021 。。。。。 NOTE: attached to recovery domain 1 NOTE: starting recovery of thread=2 ckpt=1413.4330 group=1 (ARCDG) NOTE: advancing ckpt for group 1 (ARCDG) thread=2 ckpt=1413.4330 。。。。。 * allocate domain 2, invalid = TRUE NOTE: attached to recovery domain 2 NOTE: starting recovery of thread=2 ckpt=106.8449 group=2 (DATA) NOTE: advancing ckpt for group 2 (DATA) thread=2 ckpt=106.8449 。。。。。 GMON querying group 3 at 9 for pid 23, osid 37383 NOTE: cache opening disk 3 of grp 3: OCRVDISK_01 path:/dev/raw/raw21 NOTE: F1X0 found on disk 3 au 2 fcn 0.731 NOTE: cache opening disk 4 of grp 3: OCRVDISK_02 path:/dev/raw/raw22 NOTE: F1X0 found on disk 4 au 2 fcn 0.731 NOTE: cache opening disk 5 of grp 3: OCRVDISK_03 path:/dev/raw/raw23 WARNING: stale disk header detected disk=5 group=3 (OCRVDISK) f1b1locn=00000002 fcn=0.717 NOTE: stale disk header will be dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_37383.trc NOTE: F1X0 found on disk 5 au 2 fcn 0.717 NOTE: cache mounting (first) normal redundancy group 3/0x91D30B12 (OCRVDISK) * allocate domain 3, invalid = TRUE NOTE: attached to recovery domain 3 NOTE: starting recovery of thread=1 ckpt=23.116 group=3 (OCRVDISK) NOTE: advancing ckpt for group 3 (OCRVDISK) thread=1 ckpt=23.116 NOTE: cache recovered group 3 to fcn 0.1030 NOTE: redo buffer size is 256 blocks (1053184 bytes) NOTE: LGWR attempting to mount thread 1 for diskgroup 3 (OCRVDISK) NOTE: LGWR found thread 1 closed at ABA 23.115 NOTE: LGWR mounted thread 1 for diskgroup 3 (OCRVDISK) NOTE: LGWR opening thread 1 at fcn 0.1030 ABA 24.116 NOTE: cache mounting group 3/0x91D30B12 (OCRVDISK) succeeded |
从节点1 日志来看,节点2的asm实例在11:44:44秒被强行终止;而节点1 asm实例启动时,mount磁盘组的时间点在11:44:52。
因此初步怀疑是Oracle ASM实例在open期间做实例恢复时,出现异常,进而导致出现启动节点1 asm将节点2 asm实例强行终止的情况。
基于这个想法,检查两个节点asm 实例smon 进程trace文件发现如下信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
+++节点1 asm smon trace [grid@xxxxdb01 +ASM1]$ more ./trace/*smon*.trc *** ACTION NAME:() 2021-03-16 11:44:49.985 detected lock domain 1 invalid at system inc 194 detected lock domain 2 invalid at system inc 194 detected lock domain 3 invalid at system inc 194 detected lock domain 4 invalid at system inc 194 detected lock domain 5 invalid at system inc 194 ..... +++节点2 asm smon trace [grid@xxxxdb02 +ASM2]$ ls -ltr trace/*smon* -rw-r----- 1 grid oinstall 385 Mar 15 16:58 trace/+ASM2_smon_48596.trm -rw-r----- 1 grid oinstall 14659 Mar 15 16:58 trace/+ASM2_smon_48596.trc -rw-r----- 1 grid oinstall 61 Mar 16 11:47 trace/+ASM2_smon_74256.trm -rw-r----- 1 grid oinstall 3814 Mar 16 11:47 trace/+ASM2_smon_74256.trc -rw-r----- 1 grid oinstall 60 Mar 16 11:59 trace/+ASM2_smon_1752.trm -rw-r----- 1 grid oinstall 3812 Mar 16 11:59 trace/+ASM2_smon_1752.trc [grid@xxxxdb02 +ASM2]$ more trace/+ASM2_smon_74256.trc ...... detected lock domain 1 invalid at system inc 2 detected lock domain 2 invalid at system inc 2 detected lock domain 3 invalid at system inc 2 detected lock domain 4 invalid at system inc 2 ...... |
进一步检查3月15号出故障的时间点的smon trace内容;发现ocrdiskgroup 存在异常:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
*** 2021-03-15 16:46:47.100 2021-03-15 16:46:47.100334 : Start recovery for domain=1, valid=0, flags=0x0 NOTE: starting recovery of thread=1 ckpt=1476.879 group=1 (ARCDG) 2021-03-15 16:46:47.107508 : Validate domain 1 2021-03-15 16:46:47.113343 : Validated domain 1, flags = 0x20 detected lock domain 2 invalid at system inc 184 *** 2021-03-15 16:46:47.121 2021-03-15 16:46:47.121935 : Start recovery for domain=2, valid=0, flags=0x0 NOTE: starting recovery of thread=1 ckpt=88.2503 group=2 (DATA) 2021-03-15 16:46:47.129981 : Validate domain 2 2021-03-15 16:46:47.135200 : Validated domain 2, flags = 0x20 detected lock domain 3 invalid at system inc 184 *** 2021-03-15 16:46:47.140 2021-03-15 16:46:47.140375 : Start recovery for domain=3, valid=0, flags=0x0 WARNING: stale disk header detected disk=5 group=3 (OCRVDISK) f1b1locn=00000002 fcn=0.717 OSM metadata block dump: kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 2147483653 ; 0x008: disk=5 kfbh.check: 79145530 ; 0x00c: 0x04b7aa3a kfbh.fcn.base: 717 ; 0x010: 0x000002cd kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 .......... |
从上面相信来看,asm实例在启动时,需要进行recovery操作;对于Oracle数据库而言,每一个磁盘组都被成为一个domain。
从节点1 smon的trace来看有detected lock domain的相关信息。由此可以判断,节点1 asm实例在启动时;当smon进程
进行恢复时,发现无法获得domain lock;进而不断尝试。最终将节点2 asm实例强行终止后,节点1 asm smon进程才能成功获得domain lock;
最终完成了节点1 asm实例的所有磁盘组mount成功。 此时节点2的asm实例已经强行终止。
实际上在节点1 asm alert log也有类似信息;如下:
1 2 |
WARNING: stale disk header detected disk=5 group=3 (OCRVDISK) f1b1locn=00000002 fcn=0.717 NOTE: stale disk header will be dumped to /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_37383.trc |
该warning 现象意味着磁盘header信息出现了不一致的情况;因此进一步通过kfed 工具分析了该磁盘组中所有磁盘组头的情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
kfed read /dev/raw/raw21 > /tmp/1.txt kfed read /dev/raw/raw22 > /tmp/2.txt kfed read /dev/raw/raw23 > /tmp/3.txt [grid@xxxxdb02 +ASM2]$ diff /tmp/2.txt /tmp/1.txt 6c6 < kfbh.block.obj: 2147483652 ; 0x008: disk=4 --- > kfbh.block.obj: 2147483651 ; 0x008: disk=3 20c20 < kfdhdb.dsknum: 4 ; 0x024: 0x0004 --- > kfdhdb.dsknum: 3 ; 0x024: 0x0003 23c23 < kfdhdb.dskname: OCRVDISK_02 ; 0x028: length=11 --- > kfdhdb.dskname: OCRVDISK_01 ; 0x028: length=11 25c25 < kfdhdb.fgname: OCRVDISK_02 ; 0x068: length=11 --- > kfdhdb.fgname: OCRVDISK_01 ; 0x068: length=11 [grid@xxxxdb02 +ASM2]$ [grid@xxxxdb02 +ASM2]$ [grid@xxxxdb02 +ASM2]$ diff /tmp/2.txt /tmp/3.txt 6,8c6,8 < kfbh.block.obj: 2147483652 ; 0x008: disk=4 < kfbh.check: 2261076754 ; 0x00c: 0x86c54b12 < kfbh.fcn.base: 731 ; 0x010: 0x000002db --- > kfbh.block.obj: 2147483653 ; 0x008: disk=5 > kfbh.check: 2261076832 ; 0x00c: 0x86c54b60 > kfbh.fcn.base: 717 ; 0x010: 0x000002cd 20c20 < kfdhdb.dsknum: 4 ; 0x024: 0x0004 --- > kfdhdb.dsknum: 5 ; 0x024: 0x0005 23c23 < kfdhdb.dskname: OCRVDISK_02 ; 0x028: length=11 --- > kfdhdb.dskname: OCRVDISK_03 ; 0x028: length=11 25c25 < kfdhdb.fgname: OCRVDISK_02 ; 0x068: length=11 --- > kfdhdb.fgname: OCRVDISK_03 ; 0x068: length=11 49,50c49,50 < kfdhdb.spfile: 101 ; 0x0f4: 0x00000065 < kfdhdb.spfflg: 1 ; 0x0f8: 0x00000001 --- > kfdhdb.spfile: 0 ; 0x0f4: 0x00000000 > kfdhdb.spfflg: 0 ; 0x0f8: 0x00000000 [grid@xxxxdb02 +ASM2]$ |
从上述对比结果来看,统一磁盘组中;3/4号盘头信息完全一致;而报错的5号磁盘头信息存不一致。
但是实际上我们分析,上述不一致的地方kfdhdb.spfile/kfdhdb.spfflg 的状态应该是正常的。然而根据Oracle mos的如下两篇文档来看确信说明
磁盘存在逻辑不一致的情况,否则不会出现该错误;文档信息如下:
Bug 13480225 -ASM reports:WARNNING:stale disk header detected(Doc ID 13480225.8)
“Warning:stale disk header detected” in ASM alert log(Doc ID 1947021.1)
那么问题最终出现在哪里呢? 据我们后面分析研究;最根本原因在于ASM Smon在进行恢复时,发现COD数据跟该号磁盘涉及信息不匹配导致。
而并不是单纯的说该磁盘头损坏。实际上我们读过该磁盘头的备份信息,确认是完全一致的。因此不得不说是一个逻辑错误。
对于逻辑错误而言,我个人感觉Bug可能性要大一些。
当然最终建议是将该磁盘组drop 然后再添加回磁盘组;如下:
1 2 3 |
alter diskgroup ocrdisk check nopair; alter diskgroup ocrdisk drop disk OCRVDISK_03 force; alter diskgroup ocrdisk add disk '/dev/raw/raw23'; |
备注:Oracle rac 版本 11.2.0.4 /Linux. 供大家参考.
Leave a Reply
You must be logged in to post a comment.