Rac 实例被莫名重启案例分析

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

某客户的一套rac，其中node1 被莫名的重启了，如下：

1	某客户的一套rac，其中node1 被莫名的重启了，如下：

Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Doing block recovery for file 204 block 35113
Wed Aug 14 08:26:00 2013
Shutting down instance (abort)
License high water mark = 5209
Wed Aug 14 08:26:03 2013
Termination issued to instance processes. Waiting for the processes to exit
Wed Aug 14 08:26:09 2013
Instance termination failed to kill one or more processes
Instance terminated by LMON, pid = 15573208
Wed Aug 14 08:26:10 2013
Instance terminated by USER, pid = 15335602
Wed Aug 14 08:27:17 2013
Starting ORACLE instance (normal)

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Doing block recovery for file 204 block 35113

Wed Aug 14 08:26:00 2013

Shutting down instance (abort)

License high water mark = 5209

Wed Aug 14 08:26:03 2013

Termination issued to instance processes. Waiting for the processes to exit

Wed Aug 14 08:26:09 2013

Instance termination failed to kill one or more processes

Instance terminated by LMON, pid = 15573208

Wed Aug 14 08:26:10 2013

Instance terminated by USER, pid = 15335602

Wed Aug 14 08:27:17 2013

Starting ORACLE instance (normal)

同时在该时间点，rac的另外一个节点进行了reconfig，信息如下：

1	同时在该时间点，rac的另外一个节点进行了reconfig，信息如下：

<pre class="brush:php">Wed Aug 14 08:23:20 2013
IPC Send timeout detected.Sender: ospid 454822
Receiver: inst 1 binc 57382989 ospid 14856422
Wed Aug 14 08:23:29 2013
IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10
Wed Aug 14 08:23:29 2013
Communications reconfiguration: instance_number 1
Wed Aug 14 08:23:30 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:19 2013
Evicting instance 1 from cluster
Wed Aug 14 08:26:09 2013
Reconfiguration started (old inc 4, new inc 8)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Aug 14 08:26:10 2013
LMS 5: 23 GCS shadows cancelled, 1 closed
Wed Aug 14 08:26:10 2013

<pre class="brush:php">Wed Aug 14 08:23:20 2013

IPC Send timeout detected.Sender: ospid 454822

Receiver: inst 1 binc 57382989 ospid 14856422

Wed Aug 14 08:23:29 2013

IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10

Wed Aug 14 08:23:29 2013

Communications reconfiguration: instance_number 1

Wed Aug 14 08:23:30 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:19 2013

Evicting instance 1 from cluster

Wed Aug 14 08:26:09 2013

Reconfiguration started (old inc 4, new inc 8)

List of nodes:

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 14 08:26:10 2013

LMS 5: 23 GCS shadows cancelled, 1 closed

Wed Aug 14 08:26:10 2013

从上面这部分信息看不出什么东西，我们仔细来看下node1 的alert log信息：

1	从上面这部分信息看不出什么东西，我们仔细来看下node1 的alert log信息：

<pre class="brush:php">Wed Aug 14 08:23:20 2013
IPC Send timeout detected. Receiver ospid 14856422     ++++根据ospid我们发现，14856422 正是lms3.
Receiver is waiting for a latch dumping latch state for receiver -16444
Wed Aug 14 08:23:50 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc:
Wed Aug 14 08:23:54 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:51 2013
Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc:
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc:
ORA-29740: evicted by member 1, group incarnation 6
Wed Aug 14 08:25:54 2013                              +++++lmon 进程终止，抛出ora-29740错误
LMON: terminating instance due to error 29740
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc:
ORA-29740: evicted by member , group incarnation
……..
……
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:
ORA-29740: evicted by member , group incarnation
Wed Aug 14 08:25:54 2013
Doing block recovery for file 204 block 35113
Wed Aug 14 08:26:00 2013
Shutting down instance (abort)

<pre class="brush:php">Wed Aug 14 08:23:20 2013

IPC Send timeout detected. Receiver ospid 14856422 ++++根据ospid我们发现，14856422 正是lms3.

Receiver is waiting for a latch dumping latch state for receiver -16444

Wed Aug 14 08:23:50 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms3_14856422.trc:

Wed Aug 14 08:23:54 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:51 2013

Errors in file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc:

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lmon_15573208.trc:

ORA-29740: evicted by member 1, group incarnation 6

Wed Aug 14 08:25:54 2013 +++++lmon 进程终止，抛出ora-29740错误

LMON: terminating instance due to error 29740

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_pmon_6075054.trc:

ORA-29740: evicted by member , group incarnation

……..

……

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms9_14266524.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Errors in file /oraclelog/ngcrmdb/bdump/ngcrmdb1_lms8_20657014.trc:

ORA-29740: evicted by member , group incarnation

Wed Aug 14 08:25:54 2013

Doing block recovery for file 204 block 35113

Wed Aug 14 08:26:00 2013

Shutting down instance (abort)

同时，我们查看node 2节点的alert log也发现出现了node 1节点lms3进程超时的信息，如下：

Wed Aug 14 08:23:20 2013
IPC Send timeout detected.Sender: ospid 454822     +++++ crm2 节点的lms3进程
Receiver: inst 1 binc 57382989 ospid 14856422      +++++ crm1 节点的lms3进程
Wed Aug 14 08:23:29 2013
IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10
Wed Aug 14 08:23:29 2013
Communications reconfiguration: instance_number 1
Wed Aug 14 08:23:30 2013
Trace dumping is performing id=[cdmp_20130814082329]
Wed Aug 14 08:25:19 2013
Evicting instance 1 from cluster
Wed Aug 14 08:26:09 2013
Reconfiguration started (old inc 4, new inc 8)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Wed Aug 14 08:26:10 2013
LMS 5: 23 GCS shadows cancelled, 1 closed
Wed Aug 14 08:26:10 2013

Wed Aug 14 08:23:20 2013

IPC Send timeout detected.Sender: ospid 454822 +++++ crm2 节点的lms3进程

Receiver: inst 1 binc 57382989 ospid 14856422 +++++ crm1 节点的lms3进程

Wed Aug 14 08:23:29 2013

IPC Send timeout to 0.4 inc 4 for msg type 65516 from opid 10

Wed Aug 14 08:23:29 2013

Communications reconfiguration: instance_number 1

Wed Aug 14 08:23:30 2013

Trace dumping is performing id=[cdmp_20130814082329]

Wed Aug 14 08:25:19 2013

Evicting instance 1 from cluster

Wed Aug 14 08:26:09 2013

Reconfiguration started (old inc 4, new inc 8)

List of nodes:

Global Resource Directory frozen

* dead instance detected - domain 0 invalid = TRUE

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 14 08:26:10 2013

LMS 5: 23 GCS shadows cancelled, 1 closed

Wed Aug 14 08:26:10 2013

通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc，可以确认，

1	通过node 2节点的trace 文件 /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc，可以确认，

<strong>454822</strong> 进程为crm2节点的lms3进程. 如下：

1	<strong>454822</strong> 进程为crm2节点的lms3进程. 如下：

<pre class="brush:php">Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine
and Real Application Testing options
ORACLE_HOME = /oracle/app/product/10.2.0/db_1
System name:	AIX
Node name:	ngcrmdb2
Release:	3
Version:	5
Machine:	00C8A6E24C00
Instance name: ngcrmdb2
Redo thread mounted by this instance: 0 &lt;none&gt;
Oracle process number: 10
Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3)
......
......
*** 2013-08-12 10:34:58.266
kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192)
*** 2013-08-14 08:23:20.471
*** 2013-08-14 08:23:20.474
IPC Send timeout detected.Sender: ospid 454822
Receiver: inst 1 binc 57382989 ospid 14856422
SKGXPCTX: 0x1104103b0 ctx
WAIT HISTORY
Time(msec)	 Wait Type	 Return Code
----------	 ---------	 ------------
14		 NORMAL		 POSTED
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
30		 NORMAL		 TIMEDOUT
wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2
user cpu time since last wait 0 sec 0 ticks
system cpu time since last wait 0 sec 0 ticks
locked 1
blocked 8
timed wait receives 87
admno 0xbc1dc98 admport:
SSKGXPT 0x104111c0 flags  	socket no 11 	IP 192.168.5.2 	UDP 33191
context timestamp 0x57cd2bd2
buffers queued on port 1104cff40
sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
0x7dd68acd 0x12d3831a   1024      3  42918 292492860 292492982      0   2695 286577042
slot 0 rqh=1110ca6c8
seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1
slot 1 rqh=1110ca2e0
seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1
slot 2 rqh=1110c9ef8
……

<pre class="brush:php">Dump file /oraclelog/ngcrmdb/bdump/ngcrmdb2_lms3_454822.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine

and Real Application Testing options

ORACLE_HOME = /oracle/app/product/10.2.0/db_1

System name: AIX

Node name: ngcrmdb2

Release: 3

Version: 5

Machine: 00C8A6E24C00

Instance name: ngcrmdb2

Redo thread mounted by this instance: 0 <none>

Oracle process number: 10

Unix process pid: 454822, image: oracle@ngcrmdb2 (LMS3)

......

*** 2013-08-12 10:34:58.266

kjbmskipref(0x312f.3b10000)(seqs 0x13361338)(rids 0x13f813f9)(from 0x192)

*** 2013-08-14 08:23:20.471

*** 2013-08-14 08:23:20.474

IPC Send timeout detected.Sender: ospid 454822

Receiver: inst 1 binc 57382989 ospid 14856422

SKGXPCTX: 0x1104103b0 ctx

WAIT HISTORY

Time(msec) Wait Type Return Code

---------- --------- ------------

14 NORMAL POSTED

30 NORMAL TIMEDOUT

wait delta 0 sec (3 msec) ctx ts 0x57cd2bd2 last ts 0x57cd2bd2

user cpu time since last wait 0 sec 0 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 8

timed wait receives 87

admno 0xbc1dc98 admport:

SSKGXPT 0x104111c0 flags socket no 11 IP 192.168.5.2 UDP 33191

context timestamp 0x57cd2bd2

buffers queued on port 1104cff40

sconno accono ertt state seq# sent async sync rtrans acks

0x7dd68acd 0x12d3831a 1024 3 42918 292492860 292492982 0 2695 286577042

slot 0 rqh=1110ca6c8

seq=42912 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1

slot 1 rqh=1110ca2e0

seq=42913 len=176 accno=0x12d3831a start TS=0x57cd289a rt TS=0x57cd2c9a X CNT=1

slot 2 rqh=1110c9ef8

……

我们可以看出是因为实例之间lms3进程通信交互出现超时导致crm1节点

被驱逐重启。根据crm1节点的alert log信息，我们可以整理一个如下的时间关系:

Aug 14 08:23:20 –lms3 Time out（lmon check发现）

Aug 14 08:25:54 –抛出ORA-29740: evicted by member 1, group incarnation 6错误，lmon 进程终止

Aug 14 08:25:54 –完成block recover（Doing block recovery for file 204 block 35113），由smon完成.

Aug 14 08:26:00 –crm1实例 abort关闭.

通过node 1节点的告警日志，我们可以发现，在Aug 14 08:23:20 时刻，lms3进程出现time out。通常来讲，lms进程出现超时，可能是由于系统压力过大，即load过高导致。 下面通过分析lgwr,lmd以及lmon 进程的trace来证实这一点。首先我们来分析一下lgwr trace：

*** 2013-08-13 02:29:49.603
Warning: log write time 1500ms, size 278KB
*** 2013-08-13 03:24:00.496
Warning: log write time 1440ms, size 13397KB
*** 2013-08-13 03:24:04.213
Warning: log write time 540ms, size 18775KB
*** 2013-08-13 03:24:05.298
Warning: log write time 1080ms, size 10027KB
*** 2013-08-13 03:24:09.590
Warning: log write time 540ms, size 10124KB
*** 2013-08-13 03:24:10.114
Warning: log write time 520ms, size 18577KB
*** 2013-08-13 03:24:14.393
Warning: log write time 540ms, size 16529KB
*** 2013-08-13 03:24:15.400
Warning: log write time 1010ms, size 13973KB
*** 2013-08-13 03:24:16.813
Warning: log write time 500ms, size 20055KB
*** 2013-08-13 03:24:19.572
Warning: log write time 1360ms, size 11320KB
*** 2013-08-13 03:24:22.962
Warning: log write time 760ms, size 18383KB
*** 2013-08-13 03:24:25.383
Warning: log write time 510ms, size 13886KB
*** 2013-08-13 03:24:26.791
Warning: log write time 1410ms, size 14385KB
*** 2013-08-13 03:24:41.486
Warning: log write time 890ms, size 11268KB
*** 2013-08-13 03:24:42.371
Warning: log write time 880ms, size 27178KB
*** 2013-08-13 03:24:55.580
Warning: log write time 570ms, size 14009KB
*** 2013-08-13 03:24:58.326
Warning: log write time 2730ms, size 31518KB
*** 2013-08-13 09:53:36.967
Warning: log write time 1740ms, size 5KB
*** 2013-08-13 12:10:41.138
Warning: log write time 580ms, size 2KB
*** 2013-08-13 14:39:31.894
Warning: log write time 1080ms, size 102KB
*** 2013-08-14 00:51:42.142
Warning: log write time 800ms, size 347KB
*** 2013-08-14 01:20:31.054
Warning: log write time 510ms, size 1455KB
*** 2013-08-14 01:24:35.160
Warning: log write time 500ms, size 463KB
*** 2013-08-14 01:24:36.246
Warning: log write time 1080ms, size 887KB
*** 2013-08-14 01:24:37.135
Warning: log write time 890ms, size 2201KB
*** 2013-08-14 02:04:42.532
Warning: log write time 590ms, size 14KB
*** 2013-08-14 02:06:04.027
Warning: log write time 570ms, size 43KB
*** 2013-08-14 03:25:27.431
Warning: log write time 500ms, size 13805KB
*** 2013-08-14 03:25:42.751
Warning: log write time 650ms, size 1919KB
*** 2013-08-14 03:25:52.230
Warning: log write time 1130ms, size 35575KB
*** 2013-08-14 03:25:52.737
Warning: log write time 500ms, size 9288KB
*** 2013-08-14 03:25:54.660
Warning: log write time 1020ms, size 9277KB
*** 2013-08-14 03:25:55.581
Warning: log write time 910ms, size 36265KB
*** 2013-08-14 03:25:56.369
Warning: log write time 500ms, size 36267KB
*** 2013-08-14 03:25:58.585
Warning: log write time 1010ms, size 14202KB
tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x1)
tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x1)
*** 2013-08-14 03:59:30.005
Warning: log write time 540ms, size 89KB
*** 2013-08-14 08:25:54.390
error 29740 detected in background process
ORA-29740: evicted by member , group incarnation

*** 2013-08-13 02:29:49.603

Warning: log write time 1500ms, size 278KB

*** 2013-08-13 03:24:00.496

Warning: log write time 1440ms, size 13397KB

*** 2013-08-13 03:24:04.213

Warning: log write time 540ms, size 18775KB

*** 2013-08-13 03:24:05.298

Warning: log write time 1080ms, size 10027KB

*** 2013-08-13 03:24:09.590

Warning: log write time 540ms, size 10124KB

*** 2013-08-13 03:24:10.114

Warning: log write time 520ms, size 18577KB

*** 2013-08-13 03:24:14.393

Warning: log write time 540ms, size 16529KB

*** 2013-08-13 03:24:15.400

Warning: log write time 1010ms, size 13973KB

*** 2013-08-13 03:24:16.813

Warning: log write time 500ms, size 20055KB

*** 2013-08-13 03:24:19.572

Warning: log write time 1360ms, size 11320KB

*** 2013-08-13 03:24:22.962

Warning: log write time 760ms, size 18383KB

*** 2013-08-13 03:24:25.383

Warning: log write time 510ms, size 13886KB

*** 2013-08-13 03:24:26.791

Warning: log write time 1410ms, size 14385KB

*** 2013-08-13 03:24:41.486

Warning: log write time 890ms, size 11268KB

*** 2013-08-13 03:24:42.371

Warning: log write time 880ms, size 27178KB

*** 2013-08-13 03:24:55.580

Warning: log write time 570ms, size 14009KB

*** 2013-08-13 03:24:58.326

Warning: log write time 2730ms, size 31518KB

*** 2013-08-13 09:53:36.967

Warning: log write time 1740ms, size 5KB

*** 2013-08-13 12:10:41.138

Warning: log write time 580ms, size 2KB

*** 2013-08-13 14:39:31.894

Warning: log write time 1080ms, size 102KB

*** 2013-08-14 00:51:42.142

Warning: log write time 800ms, size 347KB

*** 2013-08-14 01:20:31.054

Warning: log write time 510ms, size 1455KB

*** 2013-08-14 01:24:35.160

Warning: log write time 500ms, size 463KB

*** 2013-08-14 01:24:36.246

Warning: log write time 1080ms, size 887KB

*** 2013-08-14 01:24:37.135

Warning: log write time 890ms, size 2201KB

*** 2013-08-14 02:04:42.532

Warning: log write time 590ms, size 14KB

*** 2013-08-14 02:06:04.027

Warning: log write time 570ms, size 43KB

*** 2013-08-14 03:25:27.431

Warning: log write time 500ms, size 13805KB

*** 2013-08-14 03:25:42.751

Warning: log write time 650ms, size 1919KB

*** 2013-08-14 03:25:52.230

Warning: log write time 1130ms, size 35575KB

*** 2013-08-14 03:25:52.737

Warning: log write time 500ms, size 9288KB

*** 2013-08-14 03:25:54.660

Warning: log write time 1020ms, size 9277KB

*** 2013-08-14 03:25:55.581

Warning: log write time 910ms, size 36265KB

*** 2013-08-14 03:25:56.369

Warning: log write time 500ms, size 36267KB

*** 2013-08-14 03:25:58.585

Warning: log write time 1010ms, size 14202KB

tkcrrsarc: (WARN) Failed to find ARCH for message (message:0x1)

tkcrrpa: (WARN) Failed initial attempt to send ARCH message (message:0x1)

*** 2013-08-14 03:59:30.005

Warning: log write time 540ms, size 89KB

*** 2013-08-14 08:25:54.390

error 29740 detected in background process

ORA-29740: evicted by member , group incarnation

lgwr 进程trace仅仅截取了一小部分，从这部分内容来看，在凌晨03:25~03:59这个时间段，出现了log write 超时。通过对比可以发现，其实每天的03:25左右，lgwr write超时都是最严重的，说明这个时间点系统的IO压力是最大的.

然而03:59~ 08:25之间并没有出现别的信息，说明这个时间端内，系统的IO压力应该不大，但是这并不能说明系统的负载不高。下面我们继续分析lmon 进程的trace内容

Trace ngcrmdb1_lmon_6627610.trc内容如下：

Unix process pid: 6627610, image: oracle@ngcrmdb1 (LMON) *** SERVICE NAME:() 2013-08-14 08:27:21.684 *** SESSION ID:(13202.5) 2013-08-14 08:27:21.684 GES resources 242662 pool 56 GES enqueues 378762 GES IPC: Receivers 15 Senders 15 GES IPC: Buffers Receive 1000 Send (i:24750 b:1913) Reserve 1000 GES IPC: Msg Size Regular 408 Batch 8192 Batching factor: enqueue replay 201, ack 224 Batching factor: cache replay 126 size per lock 64 kjxggin: receive buffer size = 32768 high load threshold = 200704 ++++ load过高，已经超过threshold *** 2013-08-14 08:27:26.865 kjxgmrcfg: Reconfiguration started, reason 1 +++++ 实例进行reconfiguartion，说明这个时间点，crm1节点被restart了 kjxgmcs: Setting state to 0 0. +++++ reason为0，表示不属于cluster了 *** 2013-08-14 08:27:26.867 Name Service frozen kjxgmcs: Setting state to 0 1. +++++ 节点被重新加入到cluster，状态改变为1（表示被monitor） kjxgrssvote: reconfig bitmap chksum 0xfa98ff9 cnt 2 master 0 ret 0 kjfcpiora: published my fusion master weight 7699028 kjfcpiora: published my enqueue weight 242662 kjfcpiora: publish my flogb 16 kjfcpiora: publish my cluster_database_instances parameter = 2 kjxggpoll: change poll time to 50 ms kjxgmps: proposing substate 2 kjxgmcs: Setting state to 10 2. kjfmuin: bitmap 0 1 kjfmmhi: received msg from 0 (inc 10) kjfmmhi: received msg from 1 (inc 4) Performed the unique instance identification check kjxgmps: proposing substate 3 kjxgmcs: Setting state to 10 3. Name Service recovery started Deleted all dead-instance name entries kjxgmps: proposing substate 4 kjxgmcs: Setting state to 10 4. Multicasted all local name entries for publish Replayed all pending requests kjxgmps: proposing substate 5 kjxgmcs: Setting state to 10 5. Name Service normal Name Service recovery done *** 2013-08-14 08:27:27.464 kjxgmps: proposing substate 6 kjxgmcs: Setting state to 10 6. *** 2013-08-14 08:27:27.565 *** 2013-08-14 08:27:27.565 Reconfiguration started (old inc 0, new inc 10) Dynamic remastering is disabled Synchronization timeout interval: 900 sec List of nodes: 0 1 *** 2013-08-14 08:27:27.565 kjxggpoll: change poll time to 600 ms Global Resource Directory frozen node 0 release 10 2 0 4 node 1 release 10 2 0 4 number of mastership buckets = 128 。。。。。。 Post SMON to start 1st pass IR Fix write in gcs resources sent syncr inc 10 lvl 8 to 0 (10,24/0/0) sent synca inc 10 lvl 8 (10,24/0/0) *** 2013-08-14 08:27:36.170 +++++reconfig完成 Reconfiguration complete * domain 0 valid?: 1 *** 2013-08-14 08:27:41.581 kjxgrtmc2: mounting member 0 thread 1 +++++mount thread（crm1）

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

Unix process pid: 6627610, image: oracle@ngcrmdb1 (LMON)

*** SERVICE NAME:() 2013-08-14 08:27:21.684
*** SESSION ID:(13202.5) 2013-08-14 08:27:21.684
GES resources 242662 pool 56
GES enqueues 378762
GES IPC: Receivers 15  Senders 15
GES IPC: Buffers  Receive 1000  Send (i:24750 b:1913) Reserve 1000
GES IPC: Msg Size  Regular 408  Batch 8192
Batching factor: enqueue replay 201, ack 224
Batching factor: cache replay 126 size per lock 64
kjxggin: receive buffer size = 32768
high load threshold = 200704                    ++++ load过高，已经超过threshold
*** 2013-08-14 08:27:26.865
kjxgmrcfg: Reconfiguration started, reason 1    +++++ 实例进行reconfiguartion，说明这个时间点，crm1节点被restart了
kjxgmcs: Setting state to 0 0.                  +++++ reason为0，表示不属于cluster了
*** 2013-08-14 08:27:26.867
Name Service frozen
kjxgmcs: Setting state to 0 1.          +++++ 节点被重新加入到cluster，状态改变为1（表示被monitor）
kjxgrssvote: reconfig bitmap chksum 0xfa98ff9 cnt 2 master 0 ret 0
kjfcpiora: published my fusion master weight 7699028
kjfcpiora: published my enqueue weight 242662
kjfcpiora: publish my flogb 16
kjfcpiora: publish my cluster_database_instances parameter = 2
kjxggpoll: change poll time to 50 ms
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 10 2.
kjfmuin: bitmap 0 1
kjfmmhi: received msg from 0 (inc 10)
kjfmmhi: received msg from 1 (inc 4)
Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 10 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 10 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 10 5.
Name Service normal
Name Service recovery done
*** 2013-08-14 08:27:27.464
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 10 6.
*** 2013-08-14 08:27:27.565
*** 2013-08-14 08:27:27.565
Reconfiguration started (old inc 0, new inc 10)
Dynamic remastering is disabled
Synchronization timeout interval: 900 sec
List of nodes:
0 1
*** 2013-08-14 08:27:27.565
kjxggpoll: change poll time to 600 ms
Global Resource Directory frozen
node 0
release 10 2 0 4
node 1
release 10 2 0 4
number of mastership buckets = 128
。。。。。。
Post SMON to start 1st pass IR
Fix write in gcs resources
sent syncr inc 10 lvl 8 to 0 (10,24/0/0)
sent synca inc 10 lvl 8 (10,24/0/0)
*** 2013-08-14 08:27:36.170                     +++++reconfig完成
Reconfiguration complete
*   domain 0 valid?: 1
*** 2013-08-14 08:27:41.581
kjxgrtmc2: mounting member 0 thread 1           +++++mount thread（crm1）

从该trace来看，2013-08-14 08:27:26开始reconfig，到2013-08-14 08:27:41 reconfig完成。一共经历了15秒。从上面信息来判断， lms3进程压力偏大，然后这里的load仅仅是指单个进程，而并非整个系统。在rac中，lmon进程主要作用是监控集群的global enqueues和resources资源，且会执行global enqueue recovery。实例异常终止后，lmon进程会进行gcs内存方面的处理。当一个实例加入或者离开集群后，lmon进程会对lock和resource进行reconfiguration.

1
2
3
4

从该trace来看，2013-08-14 08:27:26开始reconfig，到2013-08-14 08:27:41 reconfig完成。一共经历了15秒。从上面信息来判断，
lms3进程压力偏大，然后这里的load仅仅是指单个进程，而并非整个系统。在rac中，lmon进程主要作用是监控集群的global
enqueues和resources资源，且会执行global enqueue recovery。实例异常终止后，lmon进程会进行gcs内存方面的处理。当一个实例
加入或者离开集群后，lmon进程会对lock和resource进行reconfiguration.

此外lmon 进程还会在cluster中不同的实例间进行通讯检查，如果发现对方通讯超时，就会发出节点eviction，将该node驱逐。

1

此外lmon 进程还会在cluster中不同的实例间进行通讯检查，如果发现对方通讯超时，就会发出节点eviction，将该node驱逐。

从上面内容可以看出，确实是lms3 进程交互出现超时，lmon发出驱逐命令，将crm1节点实例进行了重启。

1

从上面内容可以看出，确实是lms3 进程交互出现超时，lmon发出驱逐命令，将crm1节点实例进行了重启。

在另外一个ngcrmdb1_lmon_15573208.trc 中记录了lms 进程的详细信息，如下：

1

在另外一个ngcrmdb1_lmon_15573208.trc 中记录了lms 进程的详细信息，如下：

*** 2013-07-28 07:12:17.257 Reconfiguration complete * domain 0 valid?: 1 *** 2013-08-14 08:20:30.089 kjfmrcvrchk: receiver LMS[3] check timed out (1376439430.1376439630.0). +++发现lms3 time out kjfmrcvrchk: Dumping callstack of lms3 Submitting asynchronized dump request [20] *** 2013-08-14 08:23:29.891 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 +++进行reconfig 保护，避免重启 kjxgrrcfgchk: Initiating reconfig, reason 3 kjxgrrcfgchk: Increased trace level for comm reconfig from 0 to 10 kjxgrrcfgchk: prev pstate 5 mapsz 512 kjxgrrcfgchk: new bmp: 0 1 kjxgrrcfgchk: work bmp: 0 1 kjxgrrcfgchk: rr bmp: 0 1 kjxgrrcfgchk: the instance is NOT in good health. *** 2013-08-14 08:23:30.496 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 +++reason 3，表示集群中的node通讯失败 *** 2013-08-14 08:23:31.100 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 ....... ....... kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:37.161 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:23:37.162 kjxgrmsghndlr: Recvd msg 1104e60a0 type 5 from 1, seq (4,5) *** 2013-08-14 08:23:37.162 Received heartbeat ack from 1 *** 2013-08-14 08:23:37.767 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:38.391 ....... ....... kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:42.040 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:23:42.041 kjxgrmsghndlr: Recvd msg 1104e6640 type 5 from 1, seq (4,5) *** 2013-08-14 08:23:42.041 Received heartbeat ack from 1 *** 2013-08-14 08:23:42.646 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:43.291 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:23:43.897 ....... ....... *** 2013-08-14 08:25:14.505 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:25:14.506 kjxgrmsghndlr: Recvd msg 1104eda80 type 5 from 1, seq (4,5) *** 2013-08-14 08:25:14.506 Received heartbeat ack from 1 *** 2013-08-14 08:25:15.110 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 ....... ....... *** 2013-08-14 08:25:19.400 kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3 *** 2013-08-14 08:25:19.400 kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272 *** 2013-08-14 08:25:19.402 kjxgrmsghndlr: Recvd msg 1104ee020 type 5 from 1, seq (4,5) *** 2013-08-14 08:25:19.402 Received heartbeat ack from 1 *** 2013-08-14 08:25:20.005 kjxgrdtrt: Evicted by 1, seq (6, 4) IMR state information Member 0, thread 1, state 0x4:4, flags 0x0040 RR seq commit 4 cur 4 Propstate 5 prv 4 pending 0 system load 3402 (normal) Member information: Member 0, incarn 2, version 0x36b98a7, thrd 1 prev thrd 65535, status 0x0007 (JRM.), err 0x0000 Member 1, incarn 4, version 0x370baab, thrd 2 prev thrd 65535, status 0x0007 (JRM.), err 0x0000

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

*** 2013-07-28 07:12:17.257
Reconfiguration complete
*   domain 0 valid?: 1
*** 2013-08-14 08:20:30.089
kjfmrcvrchk: receiver LMS[3] check timed out (1376439430.1376439630.0).  +++发现lms3 time out
kjfmrcvrchk: Dumping callstack of lms3
Submitting asynchronized dump request [20]
*** 2013-08-14 08:23:29.891
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3         +++进行reconfig 保护，避免重启
kjxgrrcfgchk: Initiating reconfig, reason 3
kjxgrrcfgchk: Increased trace level for comm reconfig from 0 to 10
kjxgrrcfgchk: prev pstate 5  mapsz 512
kjxgrrcfgchk: new  bmp: 0 1
kjxgrrcfgchk: work bmp: 0 1
kjxgrrcfgchk: rr  bmp: 0 1
kjxgrrcfgchk: the instance is NOT in good health.
*** 2013-08-14 08:23:30.496
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3  +++reason 3，表示集群中的node通讯失败
*** 2013-08-14 08:23:31.100
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
.......
.......
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:37.161
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:23:37.162
kjxgrmsghndlr: Recvd msg 1104e60a0 type 5 from 1, seq (4,5)
*** 2013-08-14 08:23:37.162
Received heartbeat ack from 1
*** 2013-08-14 08:23:37.767
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:38.391
.......
.......
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:42.040
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:23:42.041
kjxgrmsghndlr: Recvd msg 1104e6640 type 5 from 1, seq (4,5)
*** 2013-08-14 08:23:42.041
Received heartbeat ack from 1
*** 2013-08-14 08:23:42.646
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:43.291
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:23:43.897
.......
.......
*** 2013-08-14 08:25:14.505
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:25:14.506
kjxgrmsghndlr: Recvd msg 1104eda80 type 5 from 1, seq (4,5)
*** 2013-08-14 08:25:14.506
Received heartbeat ack from 1
*** 2013-08-14 08:25:15.110
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
.......
.......
*** 2013-08-14 08:25:19.400
kjxgrgetresults: Detect reconfig from 1, seq 5, reason 3
*** 2013-08-14 08:25:19.400
kjxgrnbrhbmsg: Sending HB chk req to 1, HB 822424272
*** 2013-08-14 08:25:19.402
kjxgrmsghndlr: Recvd msg 1104ee020 type 5 from 1, seq (4,5)
*** 2013-08-14 08:25:19.402
Received heartbeat ack from 1
*** 2013-08-14 08:25:20.005
kjxgrdtrt: Evicted by 1, seq (6, 4)
IMR state information
  Member 0, thread 1, state 0x4:4, flags 0x0040
  RR seq commit 4  cur 4
  Propstate 5  prv 4  pending 0
  system load 3402 (normal)

  Member information:
    Member 0, incarn 2, version 0x36b98a7, thrd 1
      prev thrd 65535, status 0x0007 (JRM.), err 0x0000
    Member 1, incarn 4, version 0x370baab, thrd 2
      prev thrd 65535, status 0x0007 (JRM.), err 0x0000

根据上面信息整理一个如下时间关系：

1

根据上面信息整理一个如下时间关系：

2013-08-14 08:20:30   ----check lms3 出现time out 2013-08-14 08:23:29   ----提高process level，进行reconfig detect保护 2013-08-14 08:25:20   ----发出Evicted 申请

1
2
3

2013-08-14 08:20:30   ----check lms3 出现time out
2013-08-14 08:23:29   ----提高process level，进行reconfig detect保护
2013-08-14 08:25:20   ----发出Evicted 申请

前面我们已经知道了是由于lms3进程交互出现超时导致实例被重启。那么crm系统在该时间点load真的很高吗？我们通过数据来进行验证：我们重点来看下lms3 进程的diag信息，如下：

1
2

前面我们已经知道了是由于lms3进程交互出现超时导致实例被重启。那么crm系统在该时间点load真的很高吗？
我们通过数据来进行验证：我们重点来看下lms3 进程的diag信息，如下：

<pre class="brush:php">*** 2013-08-14 08:25:27.917 Dumping diagnostic information for ospid 14856422: OS pid = 14856422 loadavg : 2.49 3.72 5.99 swap info: free_mem = 7493.79M rsv = 256.00M alloc = 2105.66M avail = 65536.00M swap_free = 63430.34M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240103 A oracle 14856422 1 4 39 -- 1f059f6510 110936 Jul 28 - 1414:58 ora_lms3_ngcrmdb1 open: The file access permissions do not allow the specified action.

1
2
3
4
5
6
7
8
9

<pre class="brush:php">*** 2013-08-14 08:25:27.917
Dumping diagnostic information for ospid 14856422:
OS pid = 14856422
loadavg : 2.49 3.72 5.99
swap info: free_mem = 7493.79M rsv = 256.00M
           alloc = 2105.66M avail = 65536.00M swap_free = 63430.34M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240103 A   oracle 14856422        1   4  39 -- 1f059f6510 110936            Jul 28      - 1414:58 ora_lms3_ngcrmdb1
open: The file access permissions do not allow the specified action.

1

可以看到，该进程上次启动时间是7月28号，该进程的pri值是39. 该进程的平均load为：2.49 3.72 5.99

1

可以看到，该进程上次启动时间是7月28号，该进程的pri值是39. 该进程的平均load为：2.49 3.72 5.99

从这里看来，操作系统load 并不高。通过grep 该trace，可以看到其他lms进程的loadavg 都相差不大，如下：

1

从这里看来，操作系统load 并不高。通过grep 该trace，可以看到其他lms进程的loadavg 都相差不大，如下：

[root@killdb ~]# cat ngcrmdb1_lmon_15573208.trc |grep 'loadavg :' loadavg : 2.21 3.71 6.01 loadavg : 2.35 3.72 6.00 loadavg : 2.35 3.72 6.00 loadavg : 2.49 3.72 5.99 loadavg : 2.49 3.72 5.99 loadavg : 2.49 3.72 5.99 loadavg : 2.85 3.78 6.00 loadavg : 2.85 3.78 6.00 loadavg : 2.86 3.76 5.98 loadavg : 2.86 3.76 5.98 loadavg : 2.86 3.76 5.98 loadavg : 2.79 3.74 5.96 loadavg : 2.79 3.74 5.96 loadavg : 2.81 3.72 5.94 loadavg : 2.81 3.72 5.94 loadavg : 2.81 3.72 5.94

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

[root@killdb ~]# cat ngcrmdb1_lmon_15573208.trc |grep 'loadavg :'
loadavg : 2.21 3.71 6.01
loadavg : 2.35 3.72 6.00
loadavg : 2.35 3.72 6.00
loadavg : 2.49 3.72 5.99
loadavg : 2.49 3.72 5.99
loadavg : 2.49 3.72 5.99
loadavg : 2.85 3.78 6.00
loadavg : 2.85 3.78 6.00
loadavg : 2.86 3.76 5.98
loadavg : 2.86 3.76 5.98
loadavg : 2.86 3.76 5.98
loadavg : 2.79 3.74 5.96
loadavg : 2.79 3.74 5.96
loadavg : 2.81 3.72 5.94
loadavg : 2.81 3.72 5.94
loadavg : 2.81 3.72 5.94

到这里，我们可以发现，lms3 进程load其实并不算高，那么为什么会等待超时呢？以至于rac实例之间lms3 进程交互出现超时导致实例被驱逐重启。那么lms3 进程等待什么呢？我们来看下node1 lms3进程的信息：

1
2

到这里，我们可以发现，lms3 进程load其实并不算高，那么为什么会等待超时呢？以至于rac实例之间lms3
进程交互出现超时导致实例被驱逐重启。那么lms3 进程等待什么呢？我们来看下node1 lms3进程的信息：

可以发现lms3 一直在等待cache buffer chains，且wait time很长，以至于出现超时. 一般情况下，cbc latch的持有时间都是非常短暂的，通常都在毫秒级别，然而这里其等待时间却高达几百秒.前面发现lms 3进程申请的资源70000150072db50 被 orapid 1386持有，那么我们观察一下orapid 1386是什么进程： <pre class="brush:php">SO: 70000154841a800, type: 2, owner: 0, flag: INIT/-/-/0x00 (process) Oracle pid=1386, calls cur/top: 70000146b345730/700001469390a10, flag: (0) - int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 109 0 4 last post received-location: kslpsr last process to post me: 7000015453a5098 1 6 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: 7000015453a5098 1 6 (latch info) wait_event=0 bits=2 holding (efd=16) 70000150072db50 Child cache buffers chains level=1 child#=476313 Location from where latch is held: kcbgtcr: kslbegin excl: Context saved from call: 557847035 state=busy(exclusive) (val=0x200000000000056a) holder orapid = 1386 waiters [orapid (seconds since: put on list, posted, alive check)]: 1594 (1051, 1376439951, 2) 53 (1033, 1376439951, 2) 98 (1032, 1376439951, 2) 1502 (1011, 1376439951, 2) 69 (1009, 1376439951, 2) 124 (960, 1376439951, 2) 213 (950, 1376439951, 2) 191 (948, 1376439951, 2) 75 (712, 1376439951, 2) 1359 (695, 1376439951, 2) 424 (650, 1376439951, 2) 1572 (636, 1376439951, 2) 89 (591, 1376439951, 2) 5168 (588, 1376439951, 2) 33 (521, 1376439951, 2) ++++33，就是lms3 进程 1296 (394, 1376439951, 2) 4049 (334, 1376439951, 2) 163 (333, 1376439951, 2) 4437 (91, 1376439951, 2) 4459 (32, 1376439951, 2) waiter count=20 Process Group: DEFAULT, pseudo proc: 7000015446c94e8 O/S info: user: oracle, term: UNKNOWN, ospid: 7504228 OSD pid info: Unix process pid: 7504228, image: <a href="mailto:oracle@ngcrmdb1">oracle@ngcrmdb1</a>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

可以发现lms3 一直在等待cache buffer chains，且wait time很长，以至于出现超时. 一般情况下，cbc latch的持有时间
都是非常短暂的，通常都在毫秒级别，然而这里其等待时间却高达几百秒.前面发现lms 3进程申请的资源70000150072db50
被 orapid 1386持有，那么我们观察一下orapid 1386是什么进程：
<pre class="brush:php">SO: 70000154841a800, type: 2, owner: 0, flag: INIT/-/-/0x00
  (process) Oracle pid=1386, calls cur/top: 70000146b345730/700001469390a10, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 109 0 4
              last post received-location: kslpsr
              last process to post me: 7000015453a5098 1 6
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: 7000015453a5098 1 6
  (latch info) wait_event=0 bits=2
    holding    (efd=16) 70000150072db50 Child cache buffers chains level=1 child#=476313
        Location from where latch is held: kcbgtcr: kslbegin excl:
        Context saved from call: 557847035
        state=busy(exclusive) (val=0x200000000000056a) holder orapid = 1386
        waiters [orapid (seconds since: put on list, posted, alive check)]:
         1594 (1051, 1376439951, 2)
         53 (1033, 1376439951, 2)
         98 (1032, 1376439951, 2)
         1502 (1011, 1376439951, 2)
         69 (1009, 1376439951, 2)
         124 (960, 1376439951, 2)
         213 (950, 1376439951, 2)
         191 (948, 1376439951, 2)
         75 (712, 1376439951, 2)
         1359 (695, 1376439951, 2)
         424 (650, 1376439951, 2)
         1572 (636, 1376439951, 2)
         89 (591, 1376439951, 2)
         5168 (588, 1376439951, 2)
         33 (521, 1376439951, 2)        ++++33，就是lms3 进程
         1296 (394, 1376439951, 2)
         4049 (334, 1376439951, 2)
         163 (333, 1376439951, 2)
         4437 (91, 1376439951, 2)
         4459 (32, 1376439951, 2)
         waiter count=20
    Process Group: DEFAULT, pseudo proc: 7000015446c94e8
    O/S info: user: oracle, term: UNKNOWN, ospid: 7504228
    OSD pid info: Unix process pid: 7504228, image: <a href="mailto:oracle@ngcrmdb1">oracle@ngcrmdb1</a>

可以清楚的看到，lms3进程确实在等待pid 1386这个进程，换句话讲，lms3 进程在等待cbc latch，而该cbc latch resource 却被pid 1386所持有，一直没有释放；而且不仅仅是lms3进程在等待，还有其他19个进程也在等待.Pid 1386 进程的历史等待事件如下： <pre class="brush:php">SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=340 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=149098 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=3 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=269 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=318 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=378 buffer#=6e6d, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1031 driver id=28444553, #bytes=1, =0 temporary object counter: 0 ……….

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

可以清楚的看到，lms3进程确实在等待pid 1386这个进程，换句话讲，lms3 进程在等待cbc latch，而该cbc latch resource
却被pid 1386所持有，一直没有释放；而且不仅仅是lms3进程在等待，还有其他19个进程也在等待.Pid 1386 进程的历史等待
事件如下：
<pre class="brush:php">SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=340
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=149098
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=3
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=269
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=318
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=378
          buffer#=6e6d, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1031
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

……….

从上面信息可以发现，orapid 1386这个session是地址10.200.141.139 登录的，主机名为：ngibweb3。
从上面信息来看，该进程的等待event是: SQL*Net message from client ，且该进程持有cbc latch长时间不释放。
对于进程持有cbc latch长时间不释放，主要有几种可能性：

1. 进程hang住;
2. oracle bug或操作系统hang住.
Lms3 进程等待超过5分钟，且cbc resource被该进程持有，由此来看，该session 进程可能hang住了.

那么如何确认该进程可能已经hang住了？

*** 2013-08-14 08:06:53.459 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 9.49 9.52 8.51 swap info: free_mem = 9215.18M rsv = 256.00M alloc = 2086.34M avail = 65536.00M swap_free = 63449.66M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 7504228 1 32 76 20 1af21a7510 96324 f10006000ae020c8 Jul 28 - 568:26 oraclengcrmdb1 (LOCAL=NO) ........... ........... SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 3, prv: 0, sql: 70000145c8bbba0, psql: 70000145c969a38, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=41178 wait_time=325 seconds since wait started=0 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=325 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=2148096 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=4 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=345 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=386 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=482 buffer#=693f, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1267 driver id=28444553, #bytes=1, =0 temporary object counter: 0 *** 2013-08-14 08:08:13.068 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 9.51 9.48 8.57 swap info: free_mem = 9158.54M rsv = 256.00M alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 7504228 1 33 76 20 1af21a7510 98492 Jul 28 - 568:29 oraclengcrmdb1 (LOCAL=NO) open: The file access permissions do not allow the specified action. procstack: write(/proc/7504228/ctl): The requested resource is busy. 7504228: oraclengcrmdb1 (LOCAL=NO) 0x0000000000000000 ????????() + ?? ........ ........ SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 3, prv: 0, sql: 70000145ac1ef50, psql: 70000145db36f00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'gc current request' blocking sess=0x0 seq=42098 wait_time=432 seconds since wait started=0 file#=23f, block#=fe61a, id#=2000001 Dumping Session Wait History for 'gc current block 2-way' count=1 wait_time=432 =23f, =fe61a, =2000001 for 'gc current block 2-way' count=1 wait_time=305 =1be, =64f8, =2000001 for 'gc cr block 2-way' count=1 wait_time=423 =1be, =64f8, =1 for 'gc current block 2-way' count=1 wait_time=482 =2b3, =24e03, =2000001 for 'gc current block 2-way' count=1 wait_time=327 =1be, =64b7, =2000001 for 'gc cr block 2-way' count=1 wait_time=413 =1be, =64b7, =1 for 'gc current block 2-way' count=1 wait_time=366 =23f, =fe681, =2000001 for 'gc current block 2-way' count=1 wait_time=418 =1be, =64a9, =2000001 for 'gc cr block 2-way' count=1 wait_time=453 =1be, =64a9, =1 for 'gc current block 2-way' count=1 wait_time=454 =1b4, =fa1c2, =2000001 temporary object counter: 0 *** 2013-08-14 08:08:17.929 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 9.63 9.50 8.59 swap info: free_mem = 9263.69M rsv = 256.00M alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 A oracle 7504228 1 41 80 20 1af21a7510 96448 Jul 28 - 568:31 oraclengcrmdb1 (LOCAL=NO) open: The file access permissions do not allow the specified action. procstack: write(/proc/7504228/ctl): The requested resource is busy. 7504228: oraclengcrmdb1 (LOCAL=NO) 0x0000000000000000 ????????() + ?? *** 2013-08-14 08:08:18.810 ........... ........... SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=0 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=340 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=149098 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=3 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=269 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=318 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=378 buffer#=6e6d, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1031 driver id=28444553, #bytes=1, =0 temporary object counter: 0 *** 2013-08-14 08:25:49.363 Dumping diagnostic information for ospid 7504228: OS pid = 7504228 loadavg : 2.81 3.72 5.94 swap info: free_mem = 7515.38M rsv = 256.00M alloc = 2106.36M avail = 65536.00M swap_free = 63429.64M F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 240001 T oracle 7504228 1 0 60 20 1af21a7510 96324 Jul 28 - 568:31 oraclengcrmdb1 (LOCAL=NO) ....... ....... SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00 (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/- DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009 txn branch: 0 oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1 O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3 program: JDBC Thin Client client info: logon 10.200.141.139 application name: JDBC Thin Client, hash value=2546894660 last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051 driver id=28444553, #bytes=1, =0 Dumping Session Wait History for 'SQL*Net message from client' count=1 wait_time=340 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=2 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=149098 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=3 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=269 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'SQL*Net message from client' count=1 wait_time=318 driver id=28444553, #bytes=1, =0 for 'SQL*Net message to client' count=1 wait_time=1 driver id=28444553, #bytes=1, =0 for 'log file sync' count=1 wait_time=378 buffer#=6e6d, =0, =0 for 'SQL*Net message from client' count=1 wait_time=1031 driver id=28444553, #bytes=1, =0 temporary object counter: 0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

*** 2013-08-14 08:06:53.459
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 9.49 9.52 8.51
swap info: free_mem = 9215.18M rsv = 256.00M
           alloc = 2086.34M avail = 65536.00M swap_free = 63449.66M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  7504228        1  32  76 20 1af21a7510 96324 f10006000ae020c8   Jul 28      - 568:26 oraclengcrmdb1 (LOCAL=NO)
...........
...........
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 3, prv: 0, sql: 70000145c8bbba0, psql: 70000145c969a38, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=41178 wait_time=325 seconds since wait started=0
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=325
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=2148096
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=4
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=345
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=386
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=482
          buffer#=693f, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1267
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

*** 2013-08-14 08:08:13.068
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 9.51 9.48 8.57
swap info: free_mem = 9158.54M rsv = 256.00M
           alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  7504228        1  33  76 20 1af21a7510 98492            Jul 28      - 568:29 oraclengcrmdb1 (LOCAL=NO)
open: The file access permissions do not allow the specified action.
procstack: write(/proc/7504228/ctl): The requested resource is busy.
7504228: oraclengcrmdb1 (LOCAL=NO)
0x0000000000000000  ????????() + ??
........
........
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368ee48, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 3, prv: 0, sql: 70000145ac1ef50, psql: 70000145db36f00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'gc current request' blocking sess=0x0 seq=42098 wait_time=432 seconds since wait started=0
          file#=23f, block#=fe61a, id#=2000001
  Dumping Session Wait History
   for 'gc current block 2-way' count=1 wait_time=432
          =23f, =fe61a, =2000001
   for 'gc current block 2-way' count=1 wait_time=305
          =1be, =64f8, =2000001
   for 'gc cr block 2-way' count=1 wait_time=423
          =1be, =64f8, =1
   for 'gc current block 2-way' count=1 wait_time=482
          =2b3, =24e03, =2000001
   for 'gc current block 2-way' count=1 wait_time=327
          =1be, =64b7, =2000001
   for 'gc cr block 2-way' count=1 wait_time=413
          =1be, =64b7, =1
   for 'gc current block 2-way' count=1 wait_time=366
          =23f, =fe681, =2000001
   for 'gc current block 2-way' count=1 wait_time=418
          =1be, =64a9, =2000001
   for 'gc cr block 2-way' count=1 wait_time=453
          =1be, =64a9, =1
   for 'gc current block 2-way' count=1 wait_time=454
          =1b4, =fa1c2, =2000001
  temporary object counter: 0

*** 2013-08-14 08:08:17.929
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 9.63 9.50 8.59
swap info: free_mem = 9263.69M rsv = 256.00M
           alloc = 2086.33M avail = 65536.00M swap_free = 63449.67M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  7504228        1  41  80 20 1af21a7510 96448            Jul 28      - 568:31 oraclengcrmdb1 (LOCAL=NO)
open: The file access permissions do not allow the specified action.
procstack: write(/proc/7504228/ctl): The requested resource is busy.
7504228: oraclengcrmdb1 (LOCAL=NO)
0x0000000000000000  ????????() + ??
*** 2013-08-14 08:08:18.810
...........
...........
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=0
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=340
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=149098
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=3
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=269
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=318
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=378
          buffer#=6e6d, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1031
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

*** 2013-08-14 08:25:49.363
Dumping diagnostic information for ospid 7504228:
OS pid = 7504228
loadavg : 2.81 3.72 5.94
swap info: free_mem = 7515.38M rsv = 256.00M
           alloc = 2106.36M avail = 65536.00M swap_free = 63429.64M
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 T   oracle  7504228        1   0  60 20 1af21a7510 96324            Jul 28      - 568:31 oraclengcrmdb1 (LOCAL=NO)
.......
.......
SO: 700001542fc0208, type: 4, owner: 70000154841a800, flag: INIT/-/-/0x00
  (session) sid: 11786 trans: 7000014f368a0a8, creator: 70000154841a800, flag: (100045) USR/- BSY/-/-/-/-/-
            DID: 0001-056A-00000008, short-term DID: 0001-056A-00000009
            txn branch: 0
            oct: 7, prv: 0, sql: 70000145dad36e8, psql: 700001459f4cc00, user: 41/UOP_CRM1
  O/S info: user: webapp, term: unknown, ospid: 1234, machine: ngibweb3
            program: JDBC Thin Client
  client info: logon 10.200.141.139
  application name: JDBC Thin Client, hash value=2546894660
  last wait for 'SQL*Net message from client' blocking sess=0x0 seq=43475 wait_time=340 seconds since wait started=1051
          driver id=28444553, #bytes=1, =0
  Dumping Session Wait History
   for 'SQL*Net message from client' count=1 wait_time=340
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=2
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=149098
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=3
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=269
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message from client' count=1 wait_time=318
          driver id=28444553, #bytes=1, =0
   for 'SQL*Net message to client' count=1 wait_time=1
          driver id=28444553, #bytes=1, =0
   for 'log file sync' count=1 wait_time=378
          buffer#=6e6d, =0, =0
   for 'SQL*Net message from client' count=1 wait_time=1031
          driver id=28444553, #bytes=1, =0
  temporary object counter: 0

可以看到最后的2个时间点，2013-08-14 08:08:18和2013-08-14 08:25:49，该进程都的time cmd值都没有发生变化，且其状态都是busy. 同时执行的SQL都是70000145dad36e8，我们来看SQL：70000145dad36e8 是什么语句？

1
2

可以看到最后的2个时间点，2013-08-14 08:08:18和2013-08-14 08:25:49，该进程都的time cmd值都没有发生变化，
且其状态都是busy. 同时执行的SQL都是70000145dad36e8，我们来看SQL：70000145dad36e8 是什么语句？

<pre class="brush:php">Cursor#6(1104a1958) state=BOUND curiob=1105400e8 curflg=cd fl2=0 par=0 ses=700001542fc0208 sqltxt(70000145dad36e8)= DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP') hash=048fa56d3419181a96e347d96053b942 parent=70000145dab3218 maxchild=03 plk=70000146b2e2208 ppn=n cursor instantiation=1105400e8 used=1376438898 child#2(7000011f63f85e0) pcs=7000012b5494ea8 clk=7000013ac106198 ci=7000011ddc3d0b0 pn=70000146bb6b560 ctx=70000140e841ab8 kgsccflg=0 llk[1105400f0,1105400f0] idx=0 xscflg=c0100276 fl2=c000400 fl3=228208c fl4=100 Frames pfr 110540080 siz=11728 efr 11053ff90 siz=11672 Cursor frame dump enxt: 4.0x00000480 enxt: 3.0x00000af8 enxt: 2.0x00000ec8 enxt: 1.0x00000f58 pnxt: 2.0x00000008 pnxt: 1.0x00000030 kxscphp 1105384e8 siz=984 inu=344 nps=344

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

<pre class="brush:php">Cursor#6(1104a1958) state=BOUND curiob=1105400e8
curflg=cd fl2=0 par=0 ses=700001542fc0208
sqltxt(70000145dad36e8)=
DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP')
  hash=048fa56d3419181a96e347d96053b942
  parent=70000145dab3218 maxchild=03 plk=70000146b2e2208 ppn=n
cursor instantiation=1105400e8 used=1376438898
child#2(7000011f63f85e0) pcs=7000012b5494ea8
  clk=7000013ac106198 ci=7000011ddc3d0b0 pn=70000146bb6b560 ctx=70000140e841ab8
kgsccflg=0 llk[1105400f0,1105400f0] idx=0
xscflg=c0100276 fl2=c000400 fl3=228208c fl4=100
Frames pfr 110540080 siz=11728 efr 11053ff90 siz=11672
Cursor frame dump
  enxt: 4.0x00000480  enxt: 3.0x00000af8  enxt: 2.0x00000ec8  enxt: 1.0x00000f58
  pnxt: 2.0x00000008  pnxt: 1.0x00000030
kxscphp  1105384e8 siz=984 inu=344 nps=344

可以看到该进程从8:08就在执行一个delete操作，直到8:25分,经历了17分钟. 我们从trace信息来看，该进程确实被挂起了1051秒。

那么，最后我们通过该进程的trace来确认一下，这个session在干什么？信息如下：

Dump file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine and Real Application Testing options ORACLE_HOME = /oracle/app/product/10.2.0/db_1 System name: AIX Node name: ngcrmdb1 Release: 3 Version: 5 Machine: 00C8A6F24C00 Instance name: ngcrmdb1 Redo thread mounted by this instance: 1 Oracle process number: 1386 Unix process pid: 7504228, image: oracle@ngcrmdb1 *** 2013-08-14 08:25:51.146 *** ACTION NAME:() 2013-08-14 08:25:51.119 *** MODULE NAME:(JDBC Thin Client) 2013-08-14 08:25:51.119 *** SERVICE NAME:(ngcrmdb) 2013-08-14 08:25:51.119 *** SESSION ID:(11786.20) 2013-08-14 08:25:51.119 Received ORADEBUG command 'dump errorstack 3' from process Unix process pid: 5886552, image: *** 2013-08-14 08:25:51.146 ksedmp: internal or fatal error Current SQL statement for this session: DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP') ----- PL/SQL Call Stack ----- object line object handle number name 700001459e79290 66 procedure UOP_CRM1.P_UIP_MERGER_MCASDATA 70000145c91ebc8 1 anonymous block

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Dump file /oraclelog/ngcrmdb/udump/ngcrmdb1_ora_7504228.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining Scoring Engine
and Real Application Testing options
ORACLE_HOME = /oracle/app/product/10.2.0/db_1
System name: AIX
Node name: ngcrmdb1
Release: 3
Version: 5
Machine: 00C8A6F24C00
Instance name: ngcrmdb1
Redo thread mounted by this instance: 1
Oracle process number: 1386
Unix process pid: 7504228, image: oracle@ngcrmdb1

*** 2013-08-14 08:25:51.146
*** ACTION NAME:() 2013-08-14 08:25:51.119
*** MODULE NAME:(JDBC Thin Client) 2013-08-14 08:25:51.119
*** SERVICE NAME:(ngcrmdb) 2013-08-14 08:25:51.119
*** SESSION ID:(11786.20) 2013-08-14 08:25:51.119
Received ORADEBUG command 'dump errorstack 3' from process Unix process pid: 5886552, image:
*** 2013-08-14 08:25:51.146
ksedmp: internal or fatal error
Current SQL statement for this session:
DELETE /*+ index(A IDX_TI_B_MCAS_UDR_3) */ FROM TI_B_MCAS_UDR A WHERE DECODE(A.SERV_TYPE,'VGOP',A.SERV_TYPE||A.BIZ_TYPE_CODE,A.SERV_TYPE) NOT IN ('DSMP', '1258','CMMB','VGOP60','VGOP10','NIAP','RINP','STMM','PIMM','MGIS','PPOC','MCSP','MDOP','MBLG','SURF','NFCP','MCLD','MEBP')
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
700001459e79290        66  procedure UOP_CRM1.P_UIP_MERGER_MCASDATA
70000145c91ebc8         1  anonymous block

从进程的trace信息来看，跟我们前面的分析符合，可以看到该session在执行一个delete操作,该sql执行计划下：

1

从进程的trace信息来看，跟我们前面的分析符合，可以看到该session在执行一个delete操作,该sql执行计划下：

<pre class="brush:php">============ Plan Table ============ -----------------------------------------------------------+-----------------------------------+ | Id | Operation | Name | Rows | Bytes | Cost | Time | -----------------------------------------------------------+-----------------------------------+ | 0 | DELETE STATEMENT | | | | 39K | | | 1 | DELETE | TI_B_MCAS_UDR | | | | | | 2 | TABLE ACCESS BY INDEX ROWID | TI_B_MCAS_UDR | 1 | 47 | 39K | 00:08:05 | | 3 | INDEX FULL SCAN | IDX_TI_B_MCAS_UDR_3| 169K | | 1505 | 00:00:19 | -----------------------------------------------------------+-----------------------------------+

1
2
3
4
5
6
7
8
9
10
11

<pre class="brush:php">============
Plan Table
============
-----------------------------------------------------------+-----------------------------------+
| Id  | Operation                     | Name               | Rows  | Bytes | Cost  | Time      |
-----------------------------------------------------------+-----------------------------------+
| 0   | DELETE STATEMENT              |                    |       |       |   39K |           |
| 1   |  DELETE                       | TI_B_MCAS_UDR      |       |       |       |           |
| 2   |   TABLE ACCESS BY INDEX ROWID | TI_B_MCAS_UDR      |     1 |    47 |   39K |  00:08:05 |
| 3   |    INDEX FULL SCAN            | IDX_TI_B_MCAS_UDR_3|  169K |       |  1505 |  00:00:19 |
-----------------------------------------------------------+-----------------------------------+

可以看到，该sql执行过差主要是因为进行了index full scan，需要注意的是,index full scan 进行的单块读。且由于
该环境是RAC，oracle需要在实例之间进行block的传输，确保数据的一致性以及cache fusion的作用，而该操作主要是
lms进程来完成的。然而该进程长时间持有cbc latch不释放，可能是由于该session 已经本身已经hang住。

最终由于node 1节点lms3 进程长时间获得不到cbc latch而出现超时,被lmon检查发现,以至于最后被lmon进程将实例重启。

Tags: aix rac, IPC Send timeout, lms, ORA-29740
Posted in troubleshooting on 8月 31, 2013

3 Responses to “Rac 实例被莫名重启案例分析”

rac 实例被莫名重启的案例分析 - 数据库教程 - 开发者第2266283个问答 Says:

8月 31, 2013 at 上午 9:23

[…] 详见原文博客链接地址：Rac 实例被莫名重启一列分析本条目发布于 2013 年 9 月 1 日。属于数据库教程分类。作者是 admin。 0 次浏览 document.getElementById("bdshell_js").src = "http://bdimg.share.baidu.com/static/js/shell_v2.js?cdnversion=" + Math.ceil(new Date()/3600000) […]
dnt Says:

1月 23, 2014 at 上午 3:06

RAC 间超时不应该重启lowest lowest numbered node 吗，不为什么不重启2号节点？

You must be logged in to post a comment.

love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

Categories

Archives

最新评论

国内圈子

oracle security

Rac 实例被莫名重启案例分析

3 Responses to “Rac 实例被莫名重启案例分析”

Leave a Reply