love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

一同事管理的一套3 nodes rac，上周出现故障，由于其中一个节点出现IPC send timeout，导致另外2个node挂住，
应用无法访问，最后只能重启node。
根据 alert 整理了一个简单的时间列表：

1

2

3

4

一同事管理的一套3 nodes rac，上周出现故障，由于其中一个节点出现IPC send timeout，导致另外2个node挂住，

应用无法访问，最后只能重启node。

根据 alert 整理了一个简单的时间列表：

###### sndb3 ######
Fri Nov 25 09:26:08 2011   IPC Send timeout detected. Receiver ospid 340396   -- sndb3
..........   
Fri Nov 25 09:28:54 2011   IPC Send timeout detected. Receiver ospid 340396   -- sndb3
Fri Nov 25 10:04:19 2011   Trace dumping is performing id=[cdmp_20111125100426]
Fri Nov 25 10:10:07 2011   Warning: cluster_database_instances (2) of node 0 is <= my node id (2) 
Fri Nov 25 10:12:20 2011   Reconfiguration started  -- 开始  reconfig 此时 List of nodes: 0 2
Fri Nov 25 10:28:05 2011   Reconfiguration started  -- 第2次 reconfig List of nodes: 0 1 2
Fri Nov 25 10:49:03 2011   Reconfiguration started  -- 第3次 reconfig List of nodes: 2
Fri Nov 25 10:49:50 2011   Reconfiguration started  -- 第4次 reconfig List of nodes: 1 2
Fri Nov 25 10:55:17 2011   Reconfiguration started  -- 第5次 reconfig List of nodes: 2
Fri Nov 25 10:55:18 2011   Reconfiguration complete 
Fri Nov 25 10:55:19 2011   Shutting down instance (abort)  -- abort 数据库
Fri Nov 25 10:56:47 2011   Starting ORACLE instance -- 正常启动数据库

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

###### sndb3 ######

Fri Nov 25 09:26:08 2011 IPC Send timeout detected. Receiver ospid 340396 -- sndb3

..........

Fri Nov 25 09:28:54 2011 IPC Send timeout detected. Receiver ospid 340396 -- sndb3

Fri Nov 25 10:04:19 2011 Trace dumping is performing id=[cdmp_20111125100426]

Fri Nov 25 10:10:07 2011 Warning: cluster_database_instances (2) of node 0 is <= my node id (2)

Fri Nov 25 10:12:20 2011 Reconfiguration started -- 开始 reconfig 此时 List of nodes: 0 2

Fri Nov 25 10:28:05 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2

Fri Nov 25 10:49:03 2011 Reconfiguration started -- 第3次 reconfig List of nodes: 2

Fri Nov 25 10:49:50 2011 Reconfiguration started -- 第4次 reconfig List of nodes: 1 2

Fri Nov 25 10:55:17 2011 Reconfiguration started -- 第5次 reconfig List of nodes: 2

Fri Nov 25 10:55:18 2011 Reconfiguration complete

Fri Nov 25 10:55:19 2011 Shutting down instance (abort) -- abort 数据库

Fri Nov 25 10:56:47 2011 Starting ORACLE instance -- 正常启动数据库

###### sndb1 ######
Fri Nov 25 09:26:29 2011   IPC Send timeout detected.Sender: ospid 352952
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 09:26:32 2011   ORA-27506: IPC error connecting to a port
Fri Nov 25 09:26:33 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 09:26:38 2011   IPC Send timeout detected.Sender: ospid 258258
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 09:27:05 2011   ORA-27506: IPC error connecting to a port
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 09:42:09 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-03135: connection lost contact
ORA-03135: connection lost contact
Fri Nov 25 10:12:41 2011   Reconfiguration started  -- 第1次 reconfig List of nodes: 0 2
Fri Nov 25 10:22:55 2011   IPC Send timeout detected.Sender: ospid 127746
Receiver: inst 3 binc -981855203 ospid 340396
.............
.............
Fri Nov 25 10:22:58 2011   IPC Send timeout detected.Sender: ospid 299594
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 10:22:58 2011   ORA-27506: IPC error connecting to a port
Fri Nov 25 10:22:59 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
.............
Fri Nov 25 10:23:25 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:28:26 2011   Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2
Fri Nov 25 10:33:55 2011   Reconfiguration complete
Fri Nov 25 10:39:14 2011   IPC Send timeout detected.Sender: ospid 377580
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 10:39:18 2011   ORA-27506: IPC error connecting to a port
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:49:19 2011   ORA-29746: Cluster Synchronization Service is being shut down.
Fri Nov 25 10:49:20 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpstime3
ORA-27144: attempt to kill process failed
ORA-29746: Cluster Synchronization Service is being shut down.
Instance terminated by LMON, pid = 365348
Fri Nov 25 11:02:12 2011   Starting ORACLE instance (normal) -- 正常启动数据库实例 sndb1

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

###### sndb1 ######

Fri Nov 25 09:26:29 2011 IPC Send timeout detected.Sender: ospid 352952

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:26:32 2011 ORA-27506: IPC error connecting to a port

Fri Nov 25 09:26:33 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 09:26:38 2011 IPC Send timeout detected.Sender: ospid 258258

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:27:05 2011 ORA-27506: IPC error connecting to a port

ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 09:42:09 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

ORA-03135: connection lost contact

Fri Nov 25 10:12:41 2011 Reconfiguration started -- 第1次 reconfig List of nodes: 0 2

Fri Nov 25 10:22:55 2011 IPC Send timeout detected.Sender: ospid 127746

Receiver: inst 3 binc -981855203 ospid 340396

.............

Fri Nov 25 10:22:58 2011 IPC Send timeout detected.Sender: ospid 299594

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 10:22:58 2011 ORA-27506: IPC error connecting to a port

Fri Nov 25 10:22:59 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

.............

Fri Nov 25 10:23:25 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:28:26 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 0 1 2

Fri Nov 25 10:33:55 2011 Reconfiguration complete

Fri Nov 25 10:39:14 2011 IPC Send timeout detected.Sender: ospid 377580

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 10:39:18 2011 ORA-27506: IPC error connecting to a port

ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:49:19 2011 ORA-29746: Cluster Synchronization Service is being shut down.

Fri Nov 25 10:49:20 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0

ORA-27301: OS failure message: Error 0

ORA-27302: failure occurred at: skgpstime3

ORA-27144: attempt to kill process failed

ORA-29746: Cluster Synchronization Service is being shut down.

Instance terminated by LMON, pid = 365348

Fri Nov 25 11:02:12 2011 Starting ORACLE instance (normal) -- 正常启动数据库实例 sndb1

###### sndb2 ######
Fri Nov 25 09:26:20 2011   Trace dumping is performing id=[cdmp_20111125092634]
Fri Nov 25 09:26:24 2011   IPC Send timeout detected.Sender: ospid 352604
Receiver: inst 3 binc -981855203 ospid 340396
Fri Nov 25 09:26:50 2011   ORA-27506: IPC error connecting to a port
ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:04:24 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
Fri Nov 25 10:04:25 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-03113: end-of-file on communication channel
Fri Nov 25 10:10:13 2011   ORA-07445: exception encountered: core dump [] [] [] [] [] []
ORA-03135: connection lost contact
ORA-03135: connection lost contact
Fri Nov 25 10:12:23 2011   Shutting down instance (abort)  -- abort 数据库
Fri Nov 25 10:12:23 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpstime3
ORA-27144: attempt to kill process failed
Fri Nov 25 10:12:58 2011   Instance terminated by USER, pid = 304124
Fri Nov 25 10:27:52 2011   Starting ORACLE instance (normal) -- 正常启动数据库
Fri Nov 25 10:28:10 2011   lmon registered with NM - instance id 2 (internal mem no 1)
Fri Nov 25 10:28:12 2011   Reconfiguration started  -- 第1次 reconfig  List of nodes:0 1 2
Fri Nov 25 10:33:23 2011   IPC Send timeout detected.Sender: ospid 529138
Fri Nov 25 10:33:26 2011   ORA-27506: IPC error connecting to a port
Fri Nov 25 10:33:34 2011   ORA-00482: LMD* process terminated with error
Fri Nov 25 10:33:36 2011   Shutting down instance (abort)  -- 第2次 abort 数据库
...............
...............
Fri Nov 25 10:34:47 2011   ORA-27300: OS system dependent operation:invalid_process failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpstime3
ORA-27144: attempt to kill process failed
Fri Nov 25 10:34:53 2011   Instance terminated by USER, pid = 446862
Fri Nov 25 10:49:39 2011   Starting ORACLE instance (normal) -- 第2次正常启动数据库
Fri Nov 25 10:49:57 2011   Reconfiguration started -- 第2次 reconfig List of nodes: 1 2
Fri Nov 25 10:56:39 2011   Instance terminated by USER, pid = 205414
Fri Nov 25 11:03:59 2011   Starting ORACLE instance (normal) -- 第3次正常启动数据库
Fri Nov 25 11:04:17 2011   Reconfiguration started -- 第3次 reconfig List of nodes: 0 1 2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

###### sndb2 ######

Fri Nov 25 09:26:20 2011 Trace dumping is performing id=[cdmp_20111125092634]

Fri Nov 25 09:26:24 2011 IPC Send timeout detected.Sender: ospid 352604

Receiver: inst 3 binc -981855203 ospid 340396

Fri Nov 25 09:26:50 2011 ORA-27506: IPC error connecting to a port

ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:04:24 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

Fri Nov 25 10:04:25 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

ORA-03113: end-of-file on communication channel

Fri Nov 25 10:10:13 2011 ORA-07445: exception encountered: core dump [] [] [] [] [] []

ORA-03135: connection lost contact

Fri Nov 25 10:12:23 2011 Shutting down instance (abort) -- abort 数据库

Fri Nov 25 10:12:23 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0

ORA-27301: OS failure message: Error 0

ORA-27302: failure occurred at: skgpstime3

ORA-27144: attempt to kill process failed

Fri Nov 25 10:12:58 2011 Instance terminated by USER, pid = 304124

Fri Nov 25 10:27:52 2011 Starting ORACLE instance (normal) -- 正常启动数据库

Fri Nov 25 10:28:10 2011 lmon registered with NM - instance id 2 (internal mem no 1)

Fri Nov 25 10:28:12 2011 Reconfiguration started -- 第1次 reconfig List of nodes:0 1 2

Fri Nov 25 10:33:23 2011 IPC Send timeout detected.Sender: ospid 529138

Fri Nov 25 10:33:26 2011 ORA-27506: IPC error connecting to a port

Fri Nov 25 10:33:34 2011 ORA-00482: LMD* process terminated with error

Fri Nov 25 10:33:36 2011 Shutting down instance (abort) -- 第2次 abort 数据库

...............

Fri Nov 25 10:34:47 2011 ORA-27300: OS system dependent operation:invalid_process failed with status: 0

ORA-27301: OS failure message: Error 0

ORA-27302: failure occurred at: skgpstime3

ORA-27144: attempt to kill process failed

Fri Nov 25 10:34:53 2011 Instance terminated by USER, pid = 446862

Fri Nov 25 10:49:39 2011 Starting ORACLE instance (normal) -- 第2次正常启动数据库

Fri Nov 25 10:49:57 2011 Reconfiguration started -- 第2次 reconfig List of nodes: 1 2

Fri Nov 25 10:56:39 2011 Instance terminated by USER, pid = 205414

Fri Nov 25 11:03:59 2011 Starting ORACLE instance (normal) -- 第3次正常启动数据库

Fri Nov 25 11:04:17 2011 Reconfiguration started -- 第3次 reconfig List of nodes: 0 1 2

整体的梳理情况如上，据同事反映，该rac在25号9~10点之间，由于报错ipc send out，导致1,2两个node hang住，
他的直接描述是应用都无法连接1,2 node了，最后他通过srvctl强制停掉1,2 节点。
我们知道，出现ipc超时，而已引起系统系统挂住的情况，可能有如下几种：
1. cpu负载很高
2. 内存不足
3. 网络问题，如心跳异常
4. 系统参数设置，比如udp，tcp相关参数设置过小
5. crs和db版本不一致（这点我不确定，是因为这里的情况是crs 10.2.0.4, db 10.2.0.1), 这点是我临时加上的。
下面我们来一一排查进行分析：
sndb3 节点在故障之间的 osw 监控信息如下：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

整体的梳理情况如上，据同事反映，该rac在25号9~10点之间，由于报错ipc send out，导致1,2两个node hang住，

他的直接描述是应用都无法连接1,2 node了，最后他通过srvctl强制停掉1,2 节点。

我们知道，出现ipc超时，而已引起系统系统挂住的情况，可能有如下几种：

1. cpu负载很高

2. 内存不足

3. 网络问题，如心跳异常

4. 系统参数设置，比如udp，tcp相关参数设置过小

5. crs和db版本不一致（这点我不确定，是因为这里的情况是crs 10.2.0.4, db 10.2.0.1), 这点是我临时加上的。

下面我们来一一排查进行分析：

sndb3 节点在故障之间的 osw 监控信息如下：

zzz ***Fri Nov 25 09:25:46 EAT 2011
System configuration: lcpu=16 mem=63936MB
kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
0  0 4002889 11233515   0   0   0   0    0   0 6238 240716 20451 11  6 81  2
2  0 4002865 11233537   0   0   0   0    0   0 6215 240551 21467 10  6 82  2
2  0 4002895 11233502   0   0   0   0    0   0 5594 239027 19584 11  5 82  2
zzz ***Fri Nov 25 09:26:16 EAT 2011
System configuration: lcpu=16 mem=63936MB
kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
0  1 4002579 11229653   0   0   0   0    0   0 1950 31763 8699  4  2 88  5
0  1 4002657 11229073   0   0   0   0    0   0 2325 27718 9141  3  2 90  5
2  2 4002661 11228588   0   0   0   0    0   0 1891 22869 8332  4  1 90  5
zzz ***Fri Nov 25 09:26:46 EAT 2011
System configuration: lcpu=16 mem=63936MB
kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
1  0 4009439 11223429   0   0   0   0    0   0 820 21974 4532  4  2 95  0
1  0 4009445 11223419   0   0   0   0    0   0 797 13064 4497  3  1 96  0
1  0 4009445 11223416   0   0   0   0    0   0 621 11103 3988  3  1 96  0
zzz ***Fri Nov 25 09:27:16 EAT 2011
System configuration: lcpu=16 mem=63936MB
kthr    memory              page              faults        cpu    
----- ----------- ------------------------ ------------ -----------
r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
3  1 4012753 11217337   0   0   0   0    0   0 303 229232 3042  6 20 73  0
2  0 4012780 11216595   0   0   0   0    0   0 612 107627 3761  5  9 86  0
2  0 4012780 11216592   0   0   0   0    0   0 770 12774 4312  3  1 96  0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

zzz ***Fri Nov 25 09:25:46 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

0 0 4002889 11233515 0 0 0 0 0 0 6238 240716 20451 11 6 81 2

2 0 4002865 11233537 0 0 0 0 0 0 6215 240551 21467 10 6 82 2

2 0 4002895 11233502 0 0 0 0 0 0 5594 239027 19584 11 5 82 2

zzz ***Fri Nov 25 09:26:16 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

0 1 4002579 11229653 0 0 0 0 0 0 1950 31763 8699 4 2 88 5

0 1 4002657 11229073 0 0 0 0 0 0 2325 27718 9141 3 2 90 5

2 2 4002661 11228588 0 0 0 0 0 0 1891 22869 8332 4 1 90 5

zzz ***Fri Nov 25 09:26:46 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

1 0 4009439 11223429 0 0 0 0 0 0 820 21974 4532 4 2 95 0

1 0 4009445 11223419 0 0 0 0 0 0 797 13064 4497 3 1 96 0

1 0 4009445 11223416 0 0 0 0 0 0 621 11103 3988 3 1 96 0

zzz ***Fri Nov 25 09:27:16 EAT 2011

System configuration: lcpu=16 mem=63936MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

3 1 4012753 11217337 0 0 0 0 0 0 303 229232 3042 6 20 73 0

2 0 4012780 11216595 0 0 0 0 0 0 612 107627 3761 5 9 86 0

2 0 4012780 11216592 0 0 0 0 0 0 770 12774 4312 3 1 96 0

从上可以看到，sndb3 在ipc send超时出现时，系统cpu和内存均非常空闲，下面来检查网络：

1	从上可以看到，sndb3 在ipc send超时出现时，系统cpu和内存均非常空闲，下面来检查网络：

zzzFri Nov 25 09:25:46 EAT 2011
Name  Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs  Coll
en3   1500  link#2      0.21.5e.6.8.37    56004124     0   948458    10     0
01:00:5e:00:00:01
en3   1500  192.168.0   192.168.0.3       56004124     0   948458    10     0
224.0.0.1          
en0   1500  link#3      0.21.5e.6.8.f2    623180180     0 574030148     3     0
01:00:5e:00:00:01
en0   1500  10.10.10    10.10.10.12       623180180     0 574030148     3     0
224.0.0.1          
en1   1500  link#4      0.21.5e.6.8.f3           0     0  1401015 1401015     0
01:00:5e:00:00:01
en1   1500  10.76.200   10.76.200.132            0     0  1401015 1401015     0
224.0.0.1          
en2   1500  link#5      0.21.5e.6.8.36    435553601     0 119387543     3     0
01:00:5e:00:00:01
en2   1500  10.76.1     10.76.1.143       435553601     0 119387543     3     0
224.0.0.1          
en2   1500  10.76.1     10.76.1.144       435553601     0 119387543     3     0
224.0.0.1          
lo0   16896 link#1                        14167657     0 14181993     0     0
lo0   16896 127         127.0.0.1         14167657     0 14181993     0     0
224.0.0.1          
lo0   16896 ::1                           14167657     0 14181993     0     0
ff01::1          
ff02::1:ff00:1   
ff02::1          
icmp:
95502471 calls to icmp_error
0 errors not generated because old message was icmp
Output histogram:
echo reply: 4295876
destination unreachable: 226000
872 messages with bad code fields
0 messages < minimum length
12 bad checksums
0 messages with bad length
Input histogram:
echo reply: 1458209
destination unreachable: 56666
echo: 4295876
time exceeded: 191
4295876 message responses generated
igmp:
0 messages received
0 messages received with too few bytes
0 messages received with bad checksum
0 membership queries received
0 membership queries received with invalid field(s)
0 membership reports received
0 membership reports received with invalid field(s)
0 membership reports received for groups to which we belong
8 membership reports sent
tcp:
122743924 packets sent
116431735 data packets (2302053184 bytes)
25761 data packets (10003811 bytes) retransmitted
3553016 ack-only packets (3361721 delayed)
0 URG only packets
13 window probe packets
2423518 window update packets
309881 control packets
28566873 large sends
1403066833 bytes sent using largesend
65160 bytes is the biggest largesend
386118499 packets received
369946371 acks (for 2302274662 bytes)
74335 duplicate acks
0 acks for unsent data
98730125 packets (1777353999 bytes) received in-sequence
23016 completely duplicate packets (3194142 bytes)
0 old duplicate packets
20 packets with some dup. data (8594 bytes duped)
35986 out-of-order packets (2602936 bytes)
0 packets (0 bytes) of data after window
0 window probes
49947 window update packets
7 packets received after close
0 packets with bad hardware assisted checksum
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
13 discarded by listeners
0 discarded due to listener's queue full
283911590 ack packet headers correctly predicted
15339170 data packet headers correctly predicted
59734 connection requests
59629 connection accepts
83810 connections established (including accepts)
119948 connections closed (including 1831 drops)
0 connections with ECN capability
0 times responded to ECN
35446 embryonic connections dropped
351698548 segments updated rtt (of 112846079 attempts)
0 segments with congestion window reduced bit set
0 segments with congestion experienced bit set
0 resends due to path MTU discovery
36306 path MTU discovery terminations due to retransmits
203836 retransmit timeouts
439 connections dropped by rexmit timeout
189 fast retransmits
26 when congestion window less than 4 segments
134 newreno retransmits
27 times avoided false fast retransmits
30 persist timeouts
0 connections dropped due to persist timeout
46287 keepalive timeouts
9999 keepalive probes sent
36288 connections dropped by keepalive
0 times SACK blocks array is extended
0 times SACK holes array is extended
0 packets dropped due to memory allocation failure
0 connections in timewait reused
0 delayed ACKs for SYN
0 delayed ACKs for FIN
0 send_and_disconnects
0 spliced connections
0 spliced connections closed
0 spliced connections reset
0 spliced connections timeout
0 spliced connections persist timeout
0 spliced connections keepalive timeout
udp:
702526878 datagrams received
0 incomplete headers
3 bad data length fields
5 bad checksums
95502471 dropped due to no socket
47780240 broadcast/multicast datagrams dropped due to no socket
100987 socket buffer overflows <== Note here.
559143172 delivered
559613903 datagrams output
ip:
1106726708 total packets received
0 bad header checksums
0 with size smaller than minimum
0 with data size < data length
0 with header length < data size
0 with data length < header length
0 with bad options
0 with incorrect version number
13553929 fragments received
9 fragments dropped (dup or out of space)
841 fragments dropped after timeout
3122140 packets reassembled ok
1094399939 packets for this host
56393 packets for unknown/unsupported protocol
0 packets forwarded
207190 packets not forwardable
0 redirects sent
693606490 packets sent from this host
991 packets sent with fabricated ip header
0 output packets dropped due to no bufs, etc.
0 output packets discarded due to no route
4245736 output datagrams fragmented
20081318 fragments created
396 datagrams that can't be fragmented
1631395 IP Multicast packets dropped due to no receiver
0 successful path MTU discovery cycles
0 path MTU rediscovery cycles attempted
0 path MTU discovery no-response estimates
0 path MTU discovery response timeouts
0 path MTU discovery decreases detected
0 path MTU discovery packets sent
0 path MTU discovery memory allocation failures
0 ipintrq overflows
2 with illegal source
0 packets processed by threads
0 packets dropped by threads
14925 packets dropped due to the full socket receive buffer
0 dead gateway detection packets sent
0 dead gateway detection packet allocation failures
0 dead gateway detection gateway allocation failures
ipv6:
0 total packets received
0 with size smaller than minimum
0 with data size < data length
0 with incorrect version number
0 with illegal source
0 input packets without enough memory
0 fragments received
0 fragments dropped (dup or out of space)
0 fragments dropped after timeout
0 packets reassembled ok
0 packets for this host
0 packets for unknown/unsupported protocol
0 packets forwarded
0 packets not forwardable
0 too big packets not forwarded
0 packets sent from this host
0 packets sent with fabricated ipv6 header
0 output packets dropped due to no bufs, etc.
0 output packets without enough memory
0 output packets discarded due to no route
0 output datagrams fragmented
0 fragments created
0 packets dropped due to the full socket receive buffer
0 packets not delivered due to bad raw IPv6 checksum
icmpv6:
0 calls to icmp6_error
0 errors not generated because old message was icmpv6
Output histogram:
unreachable: 0
packets too big: 0
time exceeded: 0
parameter problems: 0
redirects: 0
echo requests: 0
echo replies: 0
group queries: 0
group reports: 0
group terminations: 0
router solicitations: 0
router advertisements: 0
neighbor solicitations: 0
neighbor advertisements: 0
0 messages with bad code fields
0 messages < minimum length
0 bad checksums
0 messages with bad length
Input histogram:
unreachable: 0
packets too big: 0
time exceeded: 0
parameter problems: 0
echo requests: 0
echo replies: 0
group queries: 0
bad group queries: 0
group reports: 0
bad group reports: 0
our groups' reports: 0
group terminations: 0
bad group terminations: 0
router solicitations: 0
bad router solicitations: 0
router advertisements: 0
bad router advertisements: 0
neighbor solicitations: 0
bad neighbor solicitations: 0
neighbor advertisements: 0
bad neighbor advertisements: 0
redirects: 0
bad redirects: 0
mobility calls when not started: 0
home agent address discovery requests: 0
bad home agent address discovery requests: 0
home agent address discovery replies: 0
bad home agent address discovery replies: 0
prefix solicitations: 0
bad prefix solicitations: 0
prefix advertisements: 0
bad prefix advertisements: 0
0 message responses generated

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

zzzFri Nov 25 09:25:46 EAT 2011

Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll

en3 1500 link#2 0.21.5e.6.8.37 56004124 0 948458 10 0

01:00:5e:00:00:01

en3 1500 192.168.0 192.168.0.3 56004124 0 948458 10 0

224.0.0.1

en0 1500 link#3 0.21.5e.6.8.f2 623180180 0 574030148 3 0

01:00:5e:00:00:01

en0 1500 10.10.10 10.10.10.12 623180180 0 574030148 3 0

224.0.0.1

en1 1500 link#4 0.21.5e.6.8.f3 0 0 1401015 1401015 0

01:00:5e:00:00:01

en1 1500 10.76.200 10.76.200.132 0 0 1401015 1401015 0

224.0.0.1

en2 1500 link#5 0.21.5e.6.8.36 435553601 0 119387543 3 0

01:00:5e:00:00:01

en2 1500 10.76.1 10.76.1.143 435553601 0 119387543 3 0

224.0.0.1

en2 1500 10.76.1 10.76.1.144 435553601 0 119387543 3 0

224.0.0.1

lo0 16896 link#1 14167657 0 14181993 0 0

lo0 16896 127 127.0.0.1 14167657 0 14181993 0 0

224.0.0.1

lo0 16896 ::1 14167657 0 14181993 0 0

ff01::1

ff02::1:ff00:1

ff02::1

icmp:

95502471 calls to icmp_error

0 errors not generated because old message was icmp

Output histogram:

echo reply: 4295876

destination unreachable: 226000

872 messages with bad code fields

0 messages < minimum length

12 bad checksums

0 messages with bad length

Input histogram:

echo reply: 1458209

destination unreachable: 56666

echo: 4295876

time exceeded: 191

4295876 message responses generated

igmp:

0 messages received

0 messages received with too few bytes

0 messages received with bad checksum

0 membership queries received

0 membership queries received with invalid field(s)

0 membership reports received

0 membership reports received with invalid field(s)

0 membership reports received for groups to which we belong

8 membership reports sent

tcp:

122743924 packets sent

116431735 data packets (2302053184 bytes)

25761 data packets (10003811 bytes) retransmitted

3553016 ack-only packets (3361721 delayed)

0 URG only packets

13 window probe packets

2423518 window update packets

309881 control packets

28566873 large sends

1403066833 bytes sent using largesend

65160 bytes is the biggest largesend

386118499 packets received

369946371 acks (for 2302274662 bytes)

74335 duplicate acks

0 acks for unsent data

98730125 packets (1777353999 bytes) received in-sequence

23016 completely duplicate packets (3194142 bytes)

0 old duplicate packets

20 packets with some dup. data (8594 bytes duped)

35986 out-of-order packets (2602936 bytes)

0 packets (0 bytes) of data after window

0 window probes

49947 window update packets

7 packets received after close

0 packets with bad hardware assisted checksum

0 discarded for bad checksums

0 discarded for bad header offset fields

0 discarded because packet too short

13 discarded by listeners

0 discarded due to listener's queue full

283911590 ack packet headers correctly predicted

15339170 data packet headers correctly predicted

59734 connection requests

59629 connection accepts

83810 connections established (including accepts)

119948 connections closed (including 1831 drops)

0 connections with ECN capability

0 times responded to ECN

35446 embryonic connections dropped

351698548 segments updated rtt (of 112846079 attempts)

0 segments with congestion window reduced bit set

0 segments with congestion experienced bit set

0 resends due to path MTU discovery

36306 path MTU discovery terminations due to retransmits

203836 retransmit timeouts

439 connections dropped by rexmit timeout

189 fast retransmits

26 when congestion window less than 4 segments

134 newreno retransmits

27 times avoided false fast retransmits

30 persist timeouts

0 connections dropped due to persist timeout

46287 keepalive timeouts

9999 keepalive probes sent

36288 connections dropped by keepalive

0 times SACK blocks array is extended

0 times SACK holes array is extended

0 packets dropped due to memory allocation failure

0 connections in timewait reused

0 delayed ACKs for SYN

0 delayed ACKs for FIN

0 send_and_disconnects

0 spliced connections

0 spliced connections closed

0 spliced connections reset

0 spliced connections timeout

0 spliced connections persist timeout

0 spliced connections keepalive timeout

udp:

702526878 datagrams received

0 incomplete headers

3 bad data length fields

5 bad checksums

95502471 dropped due to no socket

47780240 broadcast/multicast datagrams dropped due to no socket

100987 socket buffer overflows <== Note here.

559143172 delivered

559613903 datagrams output

ip:

1106726708 total packets received

0 bad header checksums

0 with size smaller than minimum

0 with data size < data length

0 with header length < data size

0 with data length < header length

0 with bad options

0 with incorrect version number

13553929 fragments received

9 fragments dropped (dup or out of space)

841 fragments dropped after timeout

3122140 packets reassembled ok

1094399939 packets for this host

56393 packets for unknown/unsupported protocol

0 packets forwarded

207190 packets not forwardable

0 redirects sent

693606490 packets sent from this host

991 packets sent with fabricated ip header

0 output packets dropped due to no bufs, etc.

0 output packets discarded due to no route

4245736 output datagrams fragmented

20081318 fragments created

396 datagrams that can't be fragmented

1631395 IP Multicast packets dropped due to no receiver

0 successful path MTU discovery cycles

0 path MTU rediscovery cycles attempted

0 path MTU discovery no-response estimates

0 path MTU discovery response timeouts

0 path MTU discovery decreases detected

0 path MTU discovery packets sent

0 path MTU discovery memory allocation failures

0 ipintrq overflows

2 with illegal source

0 packets processed by threads

0 packets dropped by threads

14925 packets dropped due to the full socket receive buffer

0 dead gateway detection packets sent

0 dead gateway detection packet allocation failures

0 dead gateway detection gateway allocation failures

ipv6:

0 total packets received

0 with size smaller than minimum

0 with data size < data length

0 with incorrect version number

0 with illegal source

0 input packets without enough memory

0 fragments received

0 fragments dropped (dup or out of space)

0 fragments dropped after timeout

0 packets reassembled ok

0 packets for this host

0 packets for unknown/unsupported protocol

0 packets forwarded

0 packets not forwardable

0 too big packets not forwarded

0 packets sent from this host

0 packets sent with fabricated ipv6 header

0 output packets dropped due to no bufs, etc.

0 output packets without enough memory

0 output packets discarded due to no route

0 output datagrams fragmented

0 fragments created

0 packets dropped due to the full socket receive buffer

0 packets not delivered due to bad raw IPv6 checksum

icmpv6:

0 calls to icmp6_error

0 errors not generated because old message was icmpv6

Output histogram:

unreachable: 0

packets too big: 0

time exceeded: 0

parameter problems: 0

redirects: 0

echo requests: 0

echo replies: 0

group queries: 0

group reports: 0

group terminations: 0

router solicitations: 0

router advertisements: 0

neighbor solicitations: 0

neighbor advertisements: 0

0 messages with bad code fields

0 messages < minimum length

0 bad checksums

0 messages with bad length

Input histogram:

unreachable: 0

packets too big: 0

time exceeded: 0

parameter problems: 0

echo requests: 0

echo replies: 0

group queries: 0

bad group queries: 0

group reports: 0

bad group reports: 0

our groups' reports: 0

group terminations: 0

bad group terminations: 0

router solicitations: 0

bad router solicitations: 0

router advertisements: 0

bad router advertisements: 0

neighbor solicitations: 0

bad neighbor solicitations: 0

neighbor advertisements: 0

bad neighbor advertisements: 0

redirects: 0

bad redirects: 0

mobility calls when not started: 0

home agent address discovery requests: 0

bad home agent address discovery requests: 0

home agent address discovery replies: 0

bad home agent address discovery replies: 0

prefix solicitations: 0

bad prefix solicitations: 0

prefix advertisements: 0

bad prefix advertisements: 0

0 message responses generated

从上面的 100987 socket buffer overflows 信息来看，似乎有包溢出，从这里看可能是udp相关参数设置有问题。
下面再来分析下crs的日志以及lmd，lmon等关键性进程的trace：

1

2

3

从上面的 100987 socket buffer overflows 信息来看，似乎有包溢出，从这里看可能是udp相关参数设置有问题。

下面再来分析下crs的日志以及lmd，lmon等关键性进程的trace：

*** 2011-11-25 09:26:38.408
IPC Send timeout detected.Sender: ospid 258258
Receiver: inst 3 binc -981855203 ospid 340396
SKGXPCTX: 0x11053d158 ctx
wait delta 0 sec (14 msec) ctx ts 0x4b421 last ts 0x4b422
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 1
timed wait receives 0
admno 0x3c67bf54 admport:
SSKGXPT 0x1053d4a4 flags  active network 0
info for network 0
socket no 11 	IP 10.10.10.10 	UDP 36952
HACMP network_id 0	sflags SSKGXPT_UP
context timestamp 0x4b421
buffers queued on port 11057a888
sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
ach     accono     sconno      admno  state   seq#    rcv rtrans   acks
0x1105d9050 0x2c61e946 0x48f33837 0x1368ad14     40  32764      1      0      1 
Submitting synchronized dump request [268435460]
*** 2011-11-25 09:27:05.144
ORA-27506: IPC error connecting to a port
Assertion failed: 0, file  skgxpu.c, line 8568
*** 2011-11-25 09:27:05.144
NOTICE: Process received signal SIGLOST. Therefore, the call
stack does not point to the location of the problem.
Take a look at /var/adm/messages to find why this 
process got SIGLOST
Exception signal: 6	0x9000000002832b4 (pthread_kill+0080) e8410028
Registers:
#if defined (SIGLOST)	/* resource lost (eg, record-lock lost) */
signal_names[SIGLOST] = "SIGLOST";
#endif

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

*** 2011-11-25 09:26:38.408

IPC Send timeout detected.Sender: ospid 258258

Receiver: inst 3 binc -981855203 ospid 340396

SKGXPCTX: 0x11053d158 ctx

wait delta 0 sec (14 msec) ctx ts 0x4b421 last ts 0x4b422

user cpu time since last wait 0 sec 0 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 1

timed wait receives 0

admno 0x3c67bf54 admport:

SSKGXPT 0x1053d4a4 flags active network 0

info for network 0

socket no 11 IP 10.10.10.10 UDP 36952

HACMP network_id 0 sflags SSKGXPT_UP

context timestamp 0x4b421

buffers queued on port 11057a888

sconno accono ertt state seq# sent async sync rtrans acks

ach accono sconno admno state seq# rcv rtrans acks

0x1105d9050 0x2c61e946 0x48f33837 0x1368ad14 40 32764 1 0 1

Submitting synchronized dump request [268435460]

*** 2011-11-25 09:27:05.144

ORA-27506: IPC error connecting to a port

Assertion failed: 0, file skgxpu.c, line 8568

*** 2011-11-25 09:27:05.144

NOTICE: Process received signal SIGLOST. Therefore, the call

stack does not point to the location of the problem.

Take a look at /var/adm/messages to find why this

process got SIGLOST

Exception signal: 6 0x9000000002832b4 (pthread_kill+0080) e8410028

Registers:

#if defined (SIGLOST) /* resource lost (eg, record-lock lost) */

signal_names[SIGLOST] = "SIGLOST";

#endif

###### sndb3 的lmon trace如下 ######
*** 2011-11-25 10:12:20.202
kjfcrfg: DRM window size = 2048->2048 (min lognb = 14) 
*** 2011-11-25 10:12:20.210
Reconfiguration started (old inc 6, new inc 8)
Synchronization timeout interval: 1020 sec
List of nodes:
0 2
*** 2011-11-25 10:12:20.213
Global Resource Directory frozen
node 0
node 2
asby init, 0/0/x2 
asby returns, 0/0/x2/false 
* Domain maps before reconfiguration: 
*   DOMAIN 0 (valid 1): 0 1 2 
* End of domain mappings 
* kjbdomrcfg2: domain 0 invalid = TRUE 
* Domain maps after recomputation: 
*   DOMAIN 0 (valid 0): 0 2 
* End of domain mappings 
Dead  inst 1 
Join  inst 
Exist inst 0 2 
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6
Active Sendback Threshold = 50 %
Communication channels reestablished
sent syncr inc 8 lvl 1 to 0 (8,5/0/0) 
synca inc 8 lvl 1 rcvd (8.5) 
replayed dom info (8.6) 
received master 0 (8.6) 
sent syncr inc 8 lvl 2 to 0 (8,7/0/0) 
synca inc 8 lvl 2 rcvd (8.7) 
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Set master node info 
sent syncr inc 8 lvl 3 to 0 (8,13/0/0) 
synca inc 8 lvl 3 rcvd (8.13) 
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 2204
sent syncr inc 8 lvl 4 to 0 (8,15/0/0) 
synca inc 8 lvl 4 rcvd (8.15) 
Dwn-cvts replayed, VALBLKs dubious
sent syncr inc 8 lvl 5 to 0 (8,18/0/0) 
synca inc 8 lvl 5 rcvd (8.18) 
All grantable enqueues granted
sent syncr inc 8 lvl 6 to 0 (8,20/0/0) 
synca inc 8 lvl 6 rcvd (8.20) 
*** 2011-11-25 10:12:22.061
Submitted all GCS cache requests
sent syncr inc 8 lvl 7 to 0 (8,22/0/0) 
synca inc 8 lvl 7 rcvd (8.22) 
Fix write in gcs resources
sent syncr inc 8 lvl 8 to 0 (8,24/0/0) 
synca inc 8 lvl 8 rcvd (8.24) 
*** 2011-11-25 10:12:23.459

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

###### sndb3 的lmon trace如下 ######

*** 2011-11-25 10:12:20.202

kjfcrfg: DRM window size = 2048->2048 (min lognb = 14)

*** 2011-11-25 10:12:20.210

Reconfiguration started (old inc 6, new inc 8)

Synchronization timeout interval: 1020 sec

List of nodes:

0 2

*** 2011-11-25 10:12:20.213

Global Resource Directory frozen

node 0

node 2

asby init, 0/0/x2

asby returns, 0/0/x2/false

* Domain maps before reconfiguration:

* DOMAIN 0 (valid 1): 0 1 2

* End of domain mappings

* kjbdomrcfg2: domain 0 invalid = TRUE

* Domain maps after recomputation:

* DOMAIN 0 (valid 0): 0 2

* End of domain mappings

Dead inst 1

Join inst

Exist inst 0 2

KSXPLOOKUP: warning cnh incarnation number may be stale 8 8 6

Active Sendback Threshold = 50 %

Communication channels reestablished

sent syncr inc 8 lvl 1 to 0 (8,5/0/0)

synca inc 8 lvl 1 rcvd (8.5)

replayed dom info (8.6)

received master 0 (8.6)

sent syncr inc 8 lvl 2 to 0 (8,7/0/0)

synca inc 8 lvl 2 rcvd (8.7)

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Set master node info

sent syncr inc 8 lvl 3 to 0 (8,13/0/0)

synca inc 8 lvl 3 rcvd (8.13)

Submitted all remote-enqueue requests

kjfcrfg: Number of mesgs sent to node 0 = 2204

sent syncr inc 8 lvl 4 to 0 (8,15/0/0)

synca inc 8 lvl 4 rcvd (8.15)

Dwn-cvts replayed, VALBLKs dubious

sent syncr inc 8 lvl 5 to 0 (8,18/0/0)

synca inc 8 lvl 5 rcvd (8.18)

All grantable enqueues granted

sent syncr inc 8 lvl 6 to 0 (8,20/0/0)

synca inc 8 lvl 6 rcvd (8.20)

*** 2011-11-25 10:12:22.061

Submitted all GCS cache requests

sent syncr inc 8 lvl 7 to 0 (8,22/0/0)

synca inc 8 lvl 7 rcvd (8.22)

Fix write in gcs resources

sent syncr inc 8 lvl 8 to 0 (8,24/0/0)

synca inc 8 lvl 8 rcvd (8.24)

*** 2011-11-25 10:12:23.459

其他2个node的crsd ocssd 以及lmon lmd等trace也没有发现有价值的信息，而且在9:26~10:12之间都没任何信息。
只有lmd trace中有点有价值的信息：

1 2	其他2个node的crsd ocssd 以及lmon lmd等trace也没有发现有价值的信息，而且在9:26~10:12之间都没任何信息。只有lmd trace中有点有价值的信息：

SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.11 	UDP port number 61800
SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.11 	UDP port number 61800
*** 2011-11-25 09:26:20.449
SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.10 	UDP port number 36992
SKGXPDOCON:connection refused: out of  memory
SSKGXPID 0xffffb640 network 0	Internet address 10.10.10.10 	UDP port number 33548
*** 2011-11-25 09:27:15.598
Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 275164, image: 
*** 2011-11-25 09:27:15.598

1

2

3

4

5

6

7

8

9

10

11

12

SKGXPDOCON:connection refused: out of memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11 UDP port number 61800

SKGXPDOCON:connection refused: out of memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.11 UDP port number 61800

*** 2011-11-25 09:26:20.449

SKGXPDOCON:connection refused: out of memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10 UDP port number 36992

SKGXPDOCON:connection refused: out of memory

SSKGXPID 0xffffb640 network 0 Internet address 10.10.10.10 UDP port number 33548

*** 2011-11-25 09:27:15.598

Received ORADEBUG command 'dump errorstack 1' from process Unix process pid: 275164, image:

*** 2011-11-25 09:27:15.598

另外通过oraadebug ipc，我们也能看到rac心跳确实是使用的udp协议，如下：

1	另外通过oraadebug ipc，我们也能看到rac心跳确实是使用的udp协议，如下：

Shared Memory:
ID		KEY
10485768	0xe1892820
Maximum processes:               = 750
ksxpdmp:  facility 0 (?) (0x1, 0x0) counts 0, 0
ksxpdmp:  Dumping the osd context
SKGXPCTX: 0x1105a4c78 ctx
wait delta 13 sec (13149 msec) ctx ts 0xb last ts 0xb
user cpu time since last wait -171798692 sec -1 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 0
timed wait receives 0
admno 0x104b9dec admport:
SSKGXPT 0x105a4fc4 flags SSKGXPT_READPENDING active network 0
info for network 0
socket no 7 	IP 10.10.10.12 	UDP 55895 <== Note here.
HACMP network_id 0	sflags SSKGXPT_UP
context timestamp 0xb
no ports
sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
0x3115c6c7 0x6e0d721f     32      3  32765      2      2      0      0      2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

Shared Memory:

ID KEY

10485768 0xe1892820

Maximum processes: = 750

ksxpdmp: facility 0 (?) (0x1, 0x0) counts 0, 0

ksxpdmp: Dumping the osd context

SKGXPCTX: 0x1105a4c78 ctx

wait delta 13 sec (13149 msec) ctx ts 0xb last ts 0xb

user cpu time since last wait -171798692 sec -1 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 0

timed wait receives 0

admno 0x104b9dec admport:

SSKGXPT 0x105a4fc4 flags SSKGXPT_READPENDING active network 0

info for network 0

socket no 7 IP 10.10.10.12 UDP 55895 <== Note here.

HACMP network_id 0 sflags SSKGXPT_UP

context timestamp 0xb

no ports

sconno accono ertt state seq# sent async sync rtrans acks

0x3115c6c7 0x6e0d721f 32 3 32765 2 2 0 0 2

另外os方面也没有发现有价值的信息，故障期间的 awr top5 如下：
<div id="B" style="height:auto; width:auto;OVERFLOW-y:auto;border:red 1px solid;margin:10px">
<img src="http://www.killdb.com/wp-content/uploads/2011/11/IPC_send_timeout_error_top_5.jpg" alt="" title="IPC_send_timeout_error_top_5" width="500" height="162" class="aligncenter size-full wp-image-854" />
</div>
关于ges/gcs enter server mode 这两个等待事件表现正常，表现为该实例在进行recover，smaon进程表现出的等待为该event。
下面是google到的解释：

1

2

3

4

5

6

7

8

9

另外os方面也没有发现有价值的信息，故障期间的 awr top5 如下：

</div>

关于ges/gcs enter server mode 这两个等待事件表现正常，表现为该实例在进行recover，smaon进程表现出的等待为该event。

下面是google到的解释：

In a RAC instance the during the instance recovery, first pass scan can be delayed by 300ms-1.5s waiting 
on GRD (Global resource directory). During that time SMAON enters the wait event which is called "ges enter server mode" .

1 2	In a RAC instance the during the instance recovery, first pass scan can be delayed by 300ms-1.5s waiting on GRD (Global resource directory). During that time SMAON enters the wait event which is called "ges enter server mode" .

另外有点值得关注是，SQ相关等待较高。
通过下面的Global Cache and Enqueue Services - Messaging Statistics 也能看出cluster interconnect 通信消耗还是挺大的。
另外mos上找到一个相关的文档，不过跟这个情况看上去不符合，该文档如下：
<span style="font-size: 12px; color: #0000ff; font-family: monospace;">
IPC Send timeout detected and Instance eviction on RDBMS and ASM 10.2.0.1 RAC [ID 370182.1]
</span>
最后给出的邮件回复如下：
该参数设置过小可能会导致rac node挂起, 故认为第2种可能性更大。
当前系统该值为：
udp_recvspace = 655360
udp_sendspace = 65536
根据官方文档，建议修改为如下值：
udp_recvspace = 2621440(该参数可以不调整)
udp_sendspace = 262144
udp_sendspace 官方文档推荐值是(DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB
参数调整以后，要监控一段时间。
补充：从awr来看，top 5中 enq: SQ - contention 等待较高，建议调大sequence的cache值（默认值20较小）

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

另外有点值得关注是，SQ相关等待较高。

通过下面的Global Cache and Enqueue Services - Messaging Statistics 也能看出cluster interconnect 通信消耗还是挺大的。

另外mos上找到一个相关的文档，不过跟这个情况看上去不符合，该文档如下：

IPC Send timeout detected and Instance eviction on RDBMS and ASM 10.2.0.1 RAC [ID 370182.1]

</span>

最后给出的邮件回复如下：

该参数设置过小可能会导致rac node挂起, 故认为第2种可能性更大。

当前系统该值为：

udp_recvspace = 655360

udp_sendspace = 65536

根据官方文档，建议修改为如下值：

udp_recvspace = 2621440(该参数可以不调整)

udp_sendspace = 262144

udp_sendspace 官方文档推荐值是(DB_BLOCK_SIZE * DB_MULTIBLOCK_READ_COUNT) + 4 KB

参数调整以后，要监控一段时间。

补充：从awr来看，top 5中 enq: SQ - contention 等待较高，建议调大sequence的cache值（默认值20较小）

TOYS_SEQ_ZF 下存在sequence，通过如下方式进行修改：
alter sequence owner.sequence_name cache 1000;
检查sequence cache值：
set lines 200
set pagesize 1000
col owner for a20
col SEQUENCE_NAME for a30
select owner,sequence_name,MAX_VALUE,INCREMENT_BY,CYCLE_FLAG,ORDER_FLAG,CACHE_SIZE 
from dba_sequence;

1

2

3

4

5

6

7

8

9

10

11

12

13

TOYS_SEQ_ZF 下存在sequence，通过如下方式进行修改：

alter sequence owner.sequence_name cache 1000;

检查sequence cache值：

set lines 200

set pagesize 1000

col owner for a20

col SEQUENCE_NAME for a30

select owner,sequence_name,MAX_VALUE,INCREMENT_BY,CYCLE_FLAG,ORDER_FLAG,CACHE_SIZE

from dba_sequence;

Categories

Archives

最新评论

国内圈子

oracle security

“IPC send timeout error” 导致RAC的节点挂起

4 Responses to ““IPC send timeout error” 导致RAC的节点挂起”

Leave a Reply