“IPC send timeout error” 导致RAC的节点挂起
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
1 2 3 4 |
一同事管理的一套3 nodes rac,上周出现故障,由于其中一个节点出现IPC send timeout,导致另外2个node挂住, 应用无法访问,最后只能重启node。 根据 alert 整理了一个简单的时间列表: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
###### sndb3 ###### Fri Nov 25 09:26:08 2011 IPC Send timeout detected. Receiver ospid 340396 -- sndb3 .......... Fri Nov 25 09:28:54 2011 IPC Send timeout detected. Receiver ospid 340396 -- sndb3 Fri Nov 25 10:04:19 2011 Trace dumping is performing id=[cdmp_20111125100426] Fri Nov 25 10:10:07 2011 Warning: cluster_database_instances (2) of node 0 is <= my node id (2) Fri Nov 25 10:12:20 2011 Reconfiguration started -- 开始 reconfig 此时 List of nodes: 0 2 Fri Nov 25 10:28:05 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2 Fri Nov 25 10:49:03 2011 Reconfiguration started -- 第3次 reconfig List of nodes: 2 Fri Nov 25 10:49:50 2011 Reconfiguration started -- 第4次 reconfig List of nodes: 1 2 Fri Nov 25 10:55:17 2011 Reconfiguration started -- 第5次 reconfig List of nodes: 2 Fri Nov 25 10:55:18 2011 Reconfiguration complete Fri Nov 25 10:55:19 2011 Shutting down instance (abort) -- abort 数据库 Fri Nov 25 10:56:47 2011 Starting ORACLE instance -- 正常启动数据库 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
###### sndb1 ###### Fri Nov 25 09:26:29 2011 IPC Send timeout detected.Sender: ospid 352952 Receiver: inst 3 binc -981855203 ospid 340396 Fri Nov 25 09:26:32 2011 ORA-27506: IPC error connecting to a port Fri Nov 25 09:26:33 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] Fri Nov 25 09:26:38 2011 IPC Send timeout detected.Sender: ospid 258258 Receiver: inst 3 binc -981855203 ospid 340396 Fri Nov 25 09:27:05 2011 ORA-27506: IPC error connecting to a port ORA-07445: exception encountered: core dump [] [] [] [] [] [] Fri Nov 25 09:42:09 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] ORA-03135: connection lost contact ORA-03135: connection lost contact Fri Nov 25 10:12:41 2011 Reconfiguration started -- 第1次 reconfig List of nodes: 0 2 Fri Nov 25 10:22:55 2011 IPC Send timeout detected.Sender: ospid 127746 Receiver: inst 3 binc -981855203 ospid 340396 ............. ............. Fri Nov 25 10:22:58 2011 IPC Send timeout detected.Sender: ospid 299594 Receiver: inst 3 binc -981855203 ospid 340396 Fri Nov 25 10:22:58 2011 ORA-27506: IPC error connecting to a port Fri Nov 25 10:22:59 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] ............. Fri Nov 25 10:23:25 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] Fri Nov 25 10:28:26 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2 Fri Nov 25 10:33:55 2011 Reconfiguration complete Fri Nov 25 10:39:14 2011 IPC Send timeout detected.Sender: ospid 377580 Receiver: inst 3 binc -981855203 ospid 340396 Fri Nov 25 10:39:18 2011 ORA-27506: IPC error connecting to a port ORA-07445: exception encountered: core dump [] [] [] [] [] [] Fri Nov 25 10:49:19 2011 ORA-29746: Cluster Synchronization Service is being shut down. Fri Nov 25 10:49:20 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgpstime3 ORA-27144: attempt to kill process failed ORA-29746: Cluster Synchronization Service is being shut down. Instance terminated by LMON, pid = 365348 Fri Nov 25 11:02:12 2011 Starting ORACLE instance (normal) -- 正常启动数据库实例 sndb1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
###### sndb2 ###### Fri Nov 25 09:26:20 2011 Trace dumping is performing id=[cdmp_20111125092634] Fri Nov 25 09:26:24 2011 IPC Send timeout detected.Sender: ospid 352604 Receiver: inst 3 binc -981855203 ospid 340396 Fri Nov 25 09:26:50 2011 ORA-27506: IPC error connecting to a port ORA-07445: exception encountered: core dump [] [] [] [] [] [] Fri Nov 25 10:04:24 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] Fri Nov 25 10:04:25 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] ORA-03113: end-of-file on communication channel Fri Nov 25 10:10:13 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] [] ORA-03135: connection lost contact ORA-03135: connection lost contact Fri Nov 25 10:12:23 2011 Shutting down instance (abort) -- abort 数据库 Fri Nov 25 10:12:23 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgpstime3 ORA-27144: attempt to kill process failed Fri Nov 25 10:12:58 2011 Instance terminated by USER, pid = 304124 Fri Nov 25 10:27:52 2011 Starting ORACLE instance (normal) -- 正常启动数据库 Fri Nov 25 10:28:10 2011 lmon registered with NM - instance id 2 (internal mem no 1) Fri Nov 25 10:28:12 2011 Reconfiguration started -- 第1次 reconfig List of nodes:0 1 2 Fri Nov 25 10:33:23 2011 IPC Send timeout detected.Sender: ospid 529138 Fri Nov 25 10:33:26 2011 ORA-27506: IPC error connecting to a port Fri Nov 25 10:33:34 2011 ORA-00482: LMD* process terminated with error Fri Nov 25 10:33:36 2011 Shutting down instance (abort) -- 第2次 abort 数据库 ............... ............... Fri Nov 25 10:34:47 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgpstime3 ORA-27144: attempt to kill process failed Fri Nov 25 10:34:53 2011 Instance terminated by USER, pid = 446862 Fri Nov 25 10:49:39 2011 Starting ORACLE instance (normal) -- 第2次正常启动数据库 Fri Nov 25 10:49:57 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 1 2 Fri Nov 25 10:56:39 2011 Instance terminated by USER, pid = 205414 Fri Nov 25 11:03:59 2011 Starting ORACLE instance (normal) -- 第3次正常启动数据库 Fri Nov 25 11:04:17 2011 Reconfiguration started -- 第3次 reconfig List of nodes: 0 1 2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
整体的梳理情况如上,据同事反映,该rac在25号9~10点之间,由于报错ipc send out,导致1,2两个node hang住, 他的直接描述是应用都无法连接1,2 node了,最后他通过srvctl强制停掉1,2 节点。 我们知道,出现ipc超时,而已引起系统系统挂住的情况,可能有如下几种: 1. cpu负载很高 2. 内存不足 3. 网络问题,如心跳异常 4. 系统参数设置,比如udp,tcp相关参数设置过小 5. crs和db版本不一致(这点我不确定,是因为这里的情况是crs 10.2.0.4, db 10.2.0.1), 这点是我临时加上的。 下面我们来一一排查进行分析: sndb3 节点在故障之间的 osw 监控信息如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
zzz ***Fri Nov 25 09:25:46 EAT 2011 System configuration: lcpu=16 mem=63936MB kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 0 0 4002889 11233515 0 0 0 0 0 0 6238 240716 20451 11 6 81 2 2 0 4002865 11233537 0 0 0 0 0 0 6215 240551 21467 10 6 82 2 2 0 4002895 11233502 0 0 0 0 0 0 5594 239027 19584 11 5 82 2 zzz ***Fri Nov 25 09:26:16 EAT 2011 System configuration: lcpu=16 mem=63936MB kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 0 1 4002579 11229653 0 0 0 0 0 0 1950 31763 8699 4 2 88 5 0 1 4002657 11229073 0 0 0 0 0 0 2325 27718 9141 3 2 90 5 2 2 4002661 11228588 0 0 0 0 0 0 1891 22869 8332 4 1 90 5 zzz ***Fri Nov 25 09:26:46 EAT 2011 System configuration: lcpu=16 mem=63936MB kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 1 0 4009439 11223429 0 0 0 0 0 0 820 21974 4532 4 2 95 0 1 0 4009445 11223419 0 0 0 0 0 0 797 13064 4497 3 1 96 0 1 0 4009445 11223416 0 0 0 0 0 0 621 11103 3988 3 1 96 0 zzz ***Fri Nov 25 09:27:16 EAT 2011 System configuration: lcpu=16 mem=63936MB kthr memory page faults cpu ----- ----------- ------------------------ ------------ ----------- r b avm fre re pi po fr sr cy in sy cs us sy id wa 3 1 4012753 11217337 0 0 0 0 0 0 303 229232 3042 6 20 73 0 2 0 4012780 11216595 0 0 0 0 0 0 612 107627 3761 5 9 86 0 2 0 4012780 11216592 0 0 0 0 0 0 770 12774 4312 3 1 96 0 |
1 |
从上可以看到,sndb3 在ipc send超时出现时,系统cpu和内存均非常空闲,下面来检查网络: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
zzzFri Nov 25 09:25:46 EAT 2011 Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll en3 1500 link#2 0.21.5e.6.8.37 56004124 0 948458 10 0 01:00:5e:00:00:01 en3 1500 192.168.0 192.168.0.3 56004124 0 948458 10 0 224.0.0.1 en0 1500 link#3 0.21.5e.6.8.f2 623180180 0 574030148 3 0 01:00:5e:00:00:01 en0 1500 10.10.10 10.10.10.12 623180180 0 574030148 3 0 224.0.0.1 en1 1500 link#4 0.21.5e.6.8.f3 0 0 1401015 1401015 0 01:00:5e:00:00:01 en1 1500 10.76.200 10.76.200.132 0 0 1401015 1401015 0 224.0.0.1 en2 1500 link#5 0.21.5e.6.8.36 435553601 0 119387543 3 0 01:00:5e:00:00:01 en2 1500 10.76.1 10.76.1.143 435553601 0 119387543 3 0 224.0.0.1 en2 1500 10.76.1 10.76.1.144 435553601 0 119387543 3 0 224.0.0.1 lo0 16896 link#1 14167657 0 14181993 0 0 lo0 16896 127 127.0.0.1 14167657 0 14181993 0 0 224.0.0.1 lo0 16896 ::1 14167657 0 14181993 0 0 ff01::1 ff02::1:ff00:1 ff02::1 icmp: 95502471 calls to icmp_error 0 errors not generated because old message was icmp Output histogram: echo reply: 4295876 destination unreachable: 226000 872 messages with bad code fields 0 messages < minimum length 12 bad checksums 0 messages with bad length Input histogram: echo reply: 1458209 destination unreachable: 56666 echo: 4295876 time exceeded: 191 4295876 message responses generated igmp: 0 messages received 0 messages received with too few bytes 0 messages received with bad checksum 0 membership queries received 0 membership queries received with invalid field(s) 0 membership reports received 0 membership reports received with invalid field(s) 0 membership reports received for groups to which we belong 8 membership reports sent tcp: 122743924 packets sent 116431735 data packets (2302053184 bytes) 25761 data packets (10003811 bytes) retransmitted 3553016 ack-only packets (3361721 delayed) 0 URG only packets 13 window probe packets 2423518 window update packets 309881 control packets 28566873 large sends 1403066833 bytes sent using largesend 65160 bytes is the biggest largesend 386118499 packets received 369946371 acks (for 2302274662 bytes) 74335 duplicate acks 0 acks for unsent data 98730125 packets (1777353999 bytes) received in-sequence 23016 completely duplicate packets (3194142 bytes) 0 old duplicate packets 20 packets with some dup. data (8594 bytes duped) 35986 out-of-order packets (2602936 bytes) 0 packets (0 bytes) of data after window 0 window probes 49947 window update packets 7 packets received after close 0 packets with bad hardware assisted checksum 0 discarded for bad checksums 0 discarded for bad header offset fields 0 discarded because packet too short 13 discarded by listeners 0 discarded due to listener's queue full 283911590 ack packet headers correctly predicted 15339170 data packet headers correctly predicted 59734 connection requests 59629 connection accepts 83810 connections established (including accepts) 119948 connections closed (including 1831 drops) 0 connections with ECN capability 0 times responded to ECN 35446 embryonic connections dropped 351698548 segments updated rtt (of 112846079 attempts) 0 segments with congestion window reduced bit set 0 segments with congestion experienced bit set 0 resends due to path MTU discovery 36306 path MTU discovery terminations due to retransmits 203836 retransmit timeouts 439 connections dropped by rexmit timeout 189 fast retransmits 26 when congestion window less than 4 segments 134 newreno retransmits 27 times avoided false fast retransmits 30 persist timeouts 0 connections dropped due to persist timeout 46287 keepalive timeouts 9999 keepalive probes sent 36288 connections dropped by keepalive 0 times SACK blocks array is extended 0 times SACK holes array is extended 0 packets dropped due to memory allocation failure 0 connections in timewait reused 0 delayed ACKs for SYN 0 delayed ACKs for FIN 0 send_and_disconnects 0 spliced connections 0 spliced connections closed 0 spliced connections reset 0 spliced connections timeout 0 spliced connections persist timeout 0 spliced connections keepalive timeout udp: 702526878 datagrams received 0 incomplete headers 3 bad data length fields 5 bad checksums 95502471 dropped due to no socket 47780240 broadcast/multicast datagrams dropped due to no socket 100987 socket buffer overflows <== Note here. 559143172 delivered 559613903 datagrams output ip: 1106726708 total packets received 0 bad header checksums 0 with size smaller than minimum 0 with data size < data length 0 with header length < data size 0 with data length < header length 0 with bad options 0 with incorrect version number 13553929 fragments received 9 fragments dropped (dup or out of space) 841 fragments dropped after timeout 3122140 packets reassembled ok 1094399939 packets for this host 56393 packets for unknown/unsupported protocol 0 packets forwarded 207190 packets not forwardable 0 redirects sent 693606490 packets sent from this host 991 packets sent with fabricated ip header 0 output packets dropped due to no bufs, etc. 0 output packets discarded due to no route 4245736 output datagrams fragmented 20081318 fragments created 396 datagrams that can't be fragmented 1631395 IP Multicast packets dropped due to no receiver 0 successful path MTU discovery cycles 0 path MTU rediscovery cycles attempted 0 path MTU discovery no-response estimates 0 path MTU discovery response timeouts 0 path MTU discovery decreases detected 0 path MTU discovery packets sent 0 path MTU discovery memory allocation failures 0 ipintrq overflows 2 with illegal source 0 packets processed by threads 0 packets dropped by threads 14925 packets dropped due to the full socket receive buffer 0 dead gateway detection packets sent 0 dead gateway detection packet allocation failures 0 dead gateway detection gateway allocation failures ipv6: 0 total packets received 0 with size smaller than minimum 0 with data size < data length 0 with incorrect version number 0 with illegal source 0 input packets without enough memory 0 fragments received 0 fragments dropped (dup or out of space) 0 fragments dropped after timeout 0 packets reassembled ok 0 packets for this host 0 packets for unknown/unsupported protocol 0 packets forwarded 0 packets not forwardable 0 too big packets not forwarded 0 packets sent from this host 0 packets sent with fabricated ipv6 header 0 output packets dropped due to no bufs, etc. 0 output packets without enough memory 0 output packets discarded due to no route 0 output datagrams fragmented 0 fragments created 0 packets dropped due to the full socket receive buffer 0 packets not delivered due to bad raw IPv6 checksum icmpv6: 0 calls to icmp6_error 0 errors not generated because old message was icmpv6 Output histogram: unreachable: 0 packets too big: 0 time exceeded: 0 parameter problems: 0 redirects: 0 echo requests: 0 echo replies: 0 group queries: 0 group reports: 0 group terminations: 0 router solicitations: 0 router advertisements: 0 neighbor solicitations: 0 neighbor advertisements: 0 0 messages with bad code fields 0 messages < minimum length 0 bad checksums 0 messages with bad length Input histogram: unreachable: 0 packets too big: 0 time exceeded: 0 parameter problems: 0 echo requests: 0 echo replies: 0 group queries: 0 bad group queries: 0 group reports: 0 bad group reports: 0 our groups' reports: 0 group terminations: 0 bad group terminations: 0 router solicitations: 0 bad router solicitations: 0 router advertisements: 0 bad router advertisements: 0 neighbor solicitations: 0 bad neighbor solicitations: 0 neighbor advertisements: 0 bad neighbor advertisements: 0 redirects: 0 bad redirects: 0 mobility calls when not started: 0 home agent address discovery requests: 0 bad home agent address discovery requests: 0 home agent address discovery replies: 0 bad home agent address discovery replies: 0 prefix solicitations: 0 bad prefix solicitations: 0 prefix advertisements: 0 bad prefix advertisements: 0 0 message responses generated |
1 2 3 |
从上面的 100987 socket buffer overflows 信息来看,似乎有包溢出,从这里看可能是udp相关参数设置有问题。 下面再来分析下crs的日志以及lmd,lmon等关键性进程的trace: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
*** 2011-11-25 09:26:38.408 IPC Send timeout detected.Sender: ospid 258258 Receiver: inst 3 binc -981855203 ospid 340396 SKGXPCTX: 0x11053d158 ctx wait delta 0 sec (14 msec) ctx ts 0x4b421 last ts 0x4b422 user cpu time since last wait 0 sec 0 ticks system cpu time since last wait 0 sec 0 ticks locked 1 blocked 1 timed wait receives 0 admno 0x3c67bf54 admport: SSKGXPT 0x1053d4a4 flags active network 0 info for network 0 socket no 11 IP 10.10.10.10 UDP 36952 HACMP network_id 0 sflags SSKGXPT_UP context timestamp 0x4b421 buffers queued on port 11057a888 sconno accono ertt state seq# sent async sync rtrans acks ach accono sconno admno state seq# rcv rtrans acks 0x1105d9050 0x2c61e946 0x48f33837 0x1368ad14 40 32764 1 0 1 Submitting synchronized dump request [268435460] *** 2011-11-25 09:27:05.144 ORA-27506: IPC error connecting to a port Assertion failed: 0, file skgxpu.c, line 8568 *** 2011-11-25 09:27:05.144 NOTICE: Process received signal SIGLOST. Therefore, the call stack does not point to the location of the problem. Take a look at /var/adm/messages to find why this process got SIGLOST Exception signal: 6 0x9000000002832b4 (pthread_kill+0080) e8410028 Registers: #if defined (SIGLOST) /* resource lost (eg, record-lock lost) */ signal_names[SIGLOST] = "SIGLOST"; #endif |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
###### sndb3 的lmon trace如下 ###### *** 2011-11-25 10:12:20.202 kjfcrfg: DRM window size = 2048->2048 (min lognb = 14) *** 2011-11-25 10:12:20.210 Reconfiguration started (old inc 6, new inc 8) Synchronization timeout interval: 1020 sec List of nodes: 0 2 *** 2011-11-25 10:12:20.213 Global Resource Directory frozen node 0 node 2 asby init, 0/0/x2 asby returns, 0/0/x2/false * Domain maps before reconfiguration: * DOMAIN 0 (valid 1): 0 1 2 * End of domain mappings * kjbdomrcfg2: domain 0 invalid = TRUE * Domain maps after recomputation: * DOMAIN 0 (valid 0): 0 2 * End of domain mappings Dead inst 1 Join inst Exist inst 0 2 KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6 KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6 KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6 KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6 KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6 Active Sendback Threshold = 50 % Communication channels reestablished sent syncr inc 8 lvl 1 to 0 (8,5/0/0) synca inc 8 lvl 1 rcvd (8.5) replayed dom info (8.6) received master 0 (8.6) sent syncr inc 8 lvl 2 to 0 (8,7/0/0) synca inc 8 lvl 2 rcvd (8.7) Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Set master node info sent syncr inc 8 lvl 3 to 0 (8,13/0/0) synca inc 8 lvl 3 rcvd (8.13) Submitted all remote-enqueue requests kjfcrfg: Number of mesgs sent to node 0 = 2204 sent syncr inc 8 lvl 4 to 0 (8,15/0/0) synca inc 8 lvl 4 rcvd (8.15) Dwn-cvts replayed, VALBLKs dubious sent syncr inc 8 lvl 5 to 0 (8,18/0/0) synca inc 8 lvl 5 rcvd (8.18) All grantable enqueues granted sent syncr inc 8 lvl 6 to 0 (8,20/0/0) synca inc 8 lvl 6 rcvd (8.20) *** 2011-11-25 10:12:22.061 Submitted all GCS cache requests sent syncr inc 8 lvl 7 to 0 (8,22/0/0) synca inc 8 lvl 7 rcvd (8.22) Fix write in gcs resources sent syncr inc 8 lvl 8 to 0 (8,24/0/0) synca inc 8 lvl 8 rcvd (8.24) *** 2011-11-25 10:12:23.459 |
1 2 |
其他2个node的crsd ocssd 以及lmon lmd等trace也没有发现有价值的信息,而且在9:26~10:12之间都没任何信息。 只有lmd trace中有点有价值的信息: |
1 2 3 4 5 6 7 8 9 10 11 12 |
SKGXPDOCON:connection refused: out of memory SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11 UDP port number 61800 SKGXPDOCON:connection refused: out of memory SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11 UDP port number 61800 *** 2011-11-25 09:26:20.449 SKGXPDOCON:connection refused: out of memory SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10 UDP port number 36992 SKGXPDOCON:connection refused: out of memory SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10 UDP port number 33548 *** 2011-11-25 09:27:15.598 Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 275164, image: *** 2011-11-25 09:27:15.598 |
1 |
另外通过oraadebug ipc,我们也能看到rac心跳确实是使用的udp协议,如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Shared Memory: ID KEY 10485768 0xe1892820 Maximum processes: = 750 ksxpdmp: facility 0 (?) (0x1, 0x0) counts 0, 0 ksxpdmp: Dumping the osd context SKGXPCTX: 0x1105a4c78 ctx wait delta 13 sec (13149 msec) ctx ts 0xb last ts 0xb user cpu time since last wait -171798692 sec -1 ticks system cpu time since last wait 0 sec 0 ticks locked 1 blocked 0 timed wait receives 0 admno 0x104b9dec admport: SSKGXPT 0x105a4fc4 flags SSKGXPT_READPENDING active network 0 info for network 0 socket no 7 IP 10.10.10.12 UDP 55895 <== Note here. HACMP network_id 0 sflags SSKGXPT_UP context timestamp 0xb no ports sconno accono ertt state seq# sent async sync rtrans acks 0x3115c6c7 0x6e0d721f 32 3 32765 2 2 0 0 2 |
1 2 3 4 5 6 7 8 9 |
另外os方面也没有发现有价值的信息,故障期间的 awr top5 如下: <div id="B" style="height:auto; width:auto;OVERFLOW-y:auto;border:red 1px solid;margin:10px"> <img src="http://www.killdb.com/wp-content/uploads/2011/11/IPC_send_timeout_error_top_5.jpg" alt="" title="IPC_send_timeout_error_top_5" width="500" height="162" class="aligncenter size-full wp-image-854" /> </div> 关于ges/gcs enter server mode 这两个等待事件表现正常,表现为该实例在进行recover,smaon进程表现出的等待为该event。 下面是google到的解释: |
1 2 |
In a RAC instance the during the instance recovery, first pass scan can be delayed by 300ms-1.5s waiting on GRD (Global resource directory). During that time SMAON enters the wait event which is called "ges enter server mode" . |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
另外有点值得关注是,SQ相关等待较高。 通过下面的Global Cache and Enqueue Services - Messaging Statistics 也能看出cluster interconnect 通信消耗还是挺大的。 另外mos上找到一个相关的文档,不过跟这个情况看上去不符合,该文档如下: <span style="font-size: 12px; color: #0000ff; font-family: monospace;"> IPC Send timeout detected and Instance eviction on RDBMS and ASM 10.2.0.1 RAC [ID 370182.1] </span> 最后给出的邮件回复如下: 该参数设置过小可能会导致rac node挂起, 故认为第2种可能性更大。 当前系统该值为: udp_recvspace = 655360 udp_sendspace = 65536 根据官方文档,建议修改为如下值: udp_recvspace = 2621440(该参数可以不调整) udp_sendspace = 262144 udp_sendspace 官方文档推荐值是(DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB 参数调整以后,要监控一段时间。 补充:从awr来看,top 5中 enq: SQ - contention 等待较高,建议调大sequence的cache值(默认值20较小) |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
TOYS_SEQ_ZF 下存在sequence,通过如下方式进行修改: alter sequence owner.sequence_name cache 1000; 检查sequence cache值: set lines 200 set pagesize 1000 col owner for a20 col SEQUENCE_NAME for a30 select owner,sequence_name,MAX_VALUE,INCREMENT_BY,CYCLE_FLAG,ORDER_FLAG,CACHE_SIZE from dba_sequence; |
4 Responses to ““IPC send timeout error” 导致RAC的节点挂起”
其实,aix环境上,还存在一种情况是,缺少os 包,导致ipc send timeout,详见AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]
谢谢楼主,先收下了
群主太强了
后记:
该问题,第一次经过调整udp相关参数以后,第2天问题重现了,为此我也特地赶到西安客户现场,
经过如下调整以后,目前系统稳定:
1. 调整db_cache_size, shared_pool_size;
2. 加大部分sequence cache值;
3. 修改root limit限制
4. 关闭DRM
5. 重启数据库实例和crs。
Leave a Reply
You must be logged in to post a comment.