最近一周遇到的几个case

本站文章除注明转载外，均为本站原创： 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客

最近一周遇到了好几个case，简单的总结一下，跟大家分享.

CASE 1.

某客户的一套9208的库，在跑批处理的时候，异常crash了，其中alert 信息如下：

Thread 1 advanced to log sequence 51938
Current log# 2 seq# 51938 mem# 0: /oradata/cifs/redologfiles/redo21.log
Current log# 2 seq# 51938 mem# 1: /oradata/cifs/redologfiles/redo22.log
Sun Jun 26 11:35:16 2011
Hex dump of Absolute File 4, Block 53784 in trace file /oracle/app/oracle/admin/cifs/
bdump/cifs_dbw0_225460.trc
***
Corrupt block relative dba: 0x0100d218 (file 4, block 53784)
Bad header found during preparing block for write
Data in bad block -
type: 6 format: 2 rdba: 0x04c05738
last change scn: 0x0001.42162bd7 seq: 0x2 flg: 0x04
consistency value in tail: 0x2bd70602
check value in block header: 0x3d2d, computed block checksum: 0x0
spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Sun Jun 26 11:35:16 2011
Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []
Sun Jun 26 11:35:17 2011
Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:
ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []
DBW0: terminating instance due to error 600
Instance terminated by DBW0, pid = 225460
Sun Jun 26 13:06:07 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
SCN scheme 3
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.8.0.

Thread 1 advanced to log sequence 51938

Current log# 2 seq# 51938 mem# 0: /oradata/cifs/redologfiles/redo21.log

Current log# 2 seq# 51938 mem# 1: /oradata/cifs/redologfiles/redo22.log

Sun Jun 26 11:35:16 2011

Hex dump of Absolute File 4, Block 53784 in trace file /oracle/app/oracle/admin/cifs/

bdump/cifs_dbw0_225460.trc

***

Corrupt block relative dba: 0x0100d218 (file 4, block 53784)

Bad header found during preparing block for write

Data in bad block -

type: 6 format: 2 rdba: 0x04c05738

last change scn: 0x0001.42162bd7 seq: 0x2 flg: 0x04

consistency value in tail: 0x2bd70602

check value in block header: 0x3d2d, computed block checksum: 0x0

spare1: 0x0, spare2: 0x0, spare3: 0x0

***

Sun Jun 26 11:35:16 2011

Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:

ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []

Sun Jun 26 11:35:17 2011

Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc:

ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], []

DBW0: terminating instance due to error 600

Instance terminated by DBW0, pid = 225460

Sun Jun 26 13:06:07 2011

Starting ORACLE instance (normal)

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

SCN scheme 3

Using log_archive_dest parameter default value

LICENSE_MAX_USERS = 0

SYS auditing is disabled

Starting up ORACLE RDBMS Version: 9.2.0.8.0.

从上面的Bad header found during preparing block for write 信息可以看出，dbw0进程在写数据库的时候，
发现这个block header信息异常。我们知道 tail=scn后4位+flg+seq 根据这个规则来看，很明显信息不对。

单从这个ora-600错误来看，metalink也有相关的文档解释，总的来说就是bug，这里根据
16831000 为10进制，首先转换为16进制，然后再转换为 file 4 和 block 53784，即得到上面的坏块号。
metalink 文档，大家可以参考ORA-600 [kcbzpb_1] [ID 139011.1] 其中提到的

bug 6401576 9.2.0.8.P22 OERI[ktbair1] / ORA-600 [6101] index corruption possible

跟我这里遇到的情况基本上一致，对于这种逻辑错误，根据oracle的解释，startup实例即可消除问题。

CASE 2.

某客户的其中一套10203 rac(asm)的某个节点出现ora-600错误，alert信息如下：

Fri Jun 24 09:06:49 2011
Thread 1 advanced to log sequence 7331
Current log# 2 seq# 7331 mem# 0: +DG_DATA/cmsdb1/onlinelog/redo02_1.log
Current log# 2 seq# 7331 mem# 1: +DG_DATA/cmsdb1/onlinelog/redo02_2.log
Fri Jun 24 10:32:58 2011
Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_2083.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Jun 24 10:33:05 2011
Trace dumping is performing id=[cdmp_20110624103305]
Fri Jun 24 10:33:18 2011
###### 省略部分内容 ######
Fri Jun 24 15:09:28 2011
Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_8188.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Fri Jun 24 15:09:29 2011
Trace dumping is performing id=[cdmp_20110624150929]

Fri Jun 24 09:06:49 2011

Thread 1 advanced to log sequence 7331

Current log# 2 seq# 7331 mem# 0: +DG_DATA/cmsdb1/onlinelog/redo02_1.log

Current log# 2 seq# 7331 mem# 1: +DG_DATA/cmsdb1/onlinelog/redo02_2.log

Fri Jun 24 10:32:58 2011

Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_2083.trc:

ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []

Fri Jun 24 10:33:05 2011

Trace dumping is performing id=[cdmp_20110624103305]

Fri Jun 24 10:33:18 2011

###### 省略部分内容 ######

Fri Jun 24 15:09:28 2011

Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_8188.trc:

ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []

Fri Jun 24 15:09:29 2011

Trace dumping is performing id=[cdmp_20110624150929]

从其中的某个trace文件，我发现了如下内容：

*** 2011-06-24 10:33:20.755
*** SERVICE NAME:(cmsdb1) 2011-06-24 10:33:20.731
*** SESSION ID:(173.17308) 2011-06-24 10:33:20.731
row 02099eae.1 continuation at
file# 8 block# 630446 slot 2 not found
**************************************************
KDSTABN_GET: 0 ….. ntab: 1
curSlot: 2 ….. nrows: 8
**************************************************
*** 2011-06-24 10:33:20.755
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
SELECT m.url,p.pubtime  FROM mastersite m, pagelayout p  WHERE m.nodeid = p.nodeid and m.articleid=p.articleid  AND p.pubtime BETWEEN SYSDATE – (:”SYS_B_0″ / :”SYS_B_1″) AND SYSDATE  AND m.url IS NOT NULL  AND m.url not like :”SYS_B_2″  order by p.pubtime
—– Call Stack Trace —–
calling              call     entry                argument values in hex     
location             type     point                (? means dubious value)    
——————– ——– ——————– —————————-
ksedmp()+744         CALL     ksedst()             000000840 ?
FFFFFFFF7FFF915C ?
000000000 ?
FFFFFFFF7FFF5C50 ?
FFFFFFFF7FFF49B8 ?
FFFFFFFF7FFF53B8 ?
kgerinv()+200        PTR_CALL 0000000000000000     000106400 ? 10652C964 ?
10652C000 ? 00010652C ?
000106400 ? 10652C964 ?
kgeasnmierr()+28     CALL     kgerinv()            106527318 ? 000000000 ?
10610F130 ? 000000000 ?
FFFFFFFF7FFF9600 ?
000001430 ?
kdsgrp()+1252        CALL     kgeasnmierr()        106527318 ?
FFFFFFFF7CF26270 ?
10610F130 ? 000000000 ?
10610F000 ? 00010610F ?
kdsfbr()+240         CALL     kdsgrp()             FFFFFFFF7CF575E0 ?
000000000 ? 000106000 ?
10610F130 ? 000000000 ?
10652C958 ?
qertbFetchByRowID()  CALL     kdsfbr()             FFFFFFFF7CF575E0 ?
+996                                               FFFFFFFF7CF57CCC ?
FFFFFFFF7CF575E0 ?
000106529 ? 800137DC0 ?
106529DB0 ?
qergiFetch()+344     PTR_CALL 0000000000000000     1056DA000 ? 0000010E4 ?
000000000 ? 000000001 ?
FFFFFFFF7CF57548 ?
FFFFFFFF7CF57CC8 ?
qerjoFetch()+580     PTR_CALL 0000000000000000     47A4819D0 ? 000000000 ?
000000000 ?
FFFFFFFF7CF57738 ?
000000027 ? 000000800 ?
qertbFetchByRowID()  PTR_CALL 0000000000000000     000000000 ? 1031FF060 ?
+556                                               FFFFFFFF7CF57C60 ?
47A4815E0 ? 47A4813F8 ?
000008000 ?
rwsfcd()+120         PTR_CALL 0000000000000000     FFFFFFFF7CF5AC78 ?
000000040 ? 47A481140 ?
000007FFF ? 47A4810D0 ?
00000013C ?
qerflFetch()+172     PTR_CALL 0000000000000000     47A480B30 ? 10320D5A0 ?
FFFFFFFF7FFFA908 ?
000007FFF ? 0000003C0 ?
105A287B0 ?
qersoFetch()+720     PTR_CALL 0000000000000000     FFFFFFFF7CF5C260 ?
103291C80 ? 47A4806E0 ?
000007FFF ? 10320D000 ?
103290B40 ?
opifch2()+8092       CALL     kpofrws()            105A287B0 ? 1024D8940 ?
105A28000 ?
FFFFFFFF7CF5C278 ?
000000020 ? 47A4809B8 ?
kpoal8()+4196        CALL     opifch2()            000000020 ?
FFFFFFFF7CF3A830 ?
000000000 ? 000106400 ?
000000001 ? 47A4861E8 ?
opiodr()+1548        PTR_CALL 0000000000000000     000000020 ? 000000000 ?
000105800 ? 000000004 ?
000000000 ? 106534460 ?
ttcpip()+1284        PTR_CALL 0000000000000000     10576A380 ? 00000005E ?
106527200 ? 000000001 ?
FFFFFFFF7CF3A830 ?
000106529 ?
opitsk()+1432        CALL     ttcpip()             0000001AF ?
FFFFFFFF7FFFD220 ?
1056C06EC ? 1056BDED0 ?
000000000 ? 106527318 ?
opiino()+1128        CALL     opitsk()             106534468 ? 000000001 ?
000000000 ? 106534460 ?
105884B38 ? 0FFFFFFFD ?
opiodr()+1548        PTR_CALL 0000000000000000     000106400 ? 106529D98 ?
000106400 ? 106529000 ?
000106400 ? 106534460 ?
opidrv()+896         CALL     opiodr()             1065335D8 ? 00000003C ?
000106400 ? 1065343E0 ?
000106534 ? 000106529 ?
sou2o()+80           CALL     opidrv()             106536B60 ? 000000000 ?
00000003C ? 106533898 ?
00000003C ? 000000000 ?
opimai_real()+124    CALL     sou2o()              FFFFFFFF7FFFF978 ?
00000003C ? 000000004 ?
FFFFFFFF7FFFF9A0 ?
105E0E000 ? 000105E0E ?
main()+152           CALL     opimai_real()        000000002 ?
FFFFFFFF7FFFFA78 ?
104051BEC ? 1064CF450 ?
00247D764 ? 000014800 ?
_start()+380         CALL     main()               000000002 ? 000000008 ?
000000000 ?
FFFFFFFF7FFFFA88 ?
FFFFFFFF7FFFFB98 ?
FFFFFFFF7D500200 ?

100

101

102

103

104

105

106

107

108

109

110

111

112

*** 2011-06-24 10:33:20.755

*** SERVICE NAME:(cmsdb1) 2011-06-24 10:33:20.731

*** SESSION ID:(173.17308) 2011-06-24 10:33:20.731

row 02099eae.1 continuation at

file# 8 block# 630446 slot 2 not found

**************************************************

KDSTABN_GET: 0 ….. ntab: 1

curSlot: 2 ….. nrows: 8

**************************************************

*** 2011-06-24 10:33:20.755

ksedmp: internal or fatal error

ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []

Current SQL statement for this session:

SELECT m.url,p.pubtime FROM mastersite m, pagelayout p WHERE m.nodeid = p.nodeid and m.articleid=p.articleid AND p.pubtime BETWEEN SYSDATE – (:”SYS_B_0″ / :”SYS_B_1″) AND SYSDATE AND m.url IS NOT NULL AND m.url not like :”SYS_B_2″ order by p.pubtime

—– Call Stack Trace —–

calling call entry argument values in hex

location type point (? means dubious value)

——————– ——– ——————– —————————-

ksedmp()+744 CALL ksedst() 000000840 ?

FFFFFFFF7FFF915C ?

000000000 ?

FFFFFFFF7FFF5C50 ?

FFFFFFFF7FFF49B8 ?

FFFFFFFF7FFF53B8 ?

kgerinv()+200 PTR_CALL 0000000000000000 000106400 ? 10652C964 ?

10652C000 ? 00010652C ?

000106400 ? 10652C964 ?

kgeasnmierr()+28 CALL kgerinv() 106527318 ? 000000000 ?

10610F130 ? 000000000 ?

FFFFFFFF7FFF9600 ?

000001430 ?

kdsgrp()+1252 CALL kgeasnmierr() 106527318 ?

FFFFFFFF7CF26270 ?

10610F130 ? 000000000 ?

10610F000 ? 00010610F ?

kdsfbr()+240 CALL kdsgrp() FFFFFFFF7CF575E0 ?

000000000 ? 000106000 ?

10610F130 ? 000000000 ?

10652C958 ?

qertbFetchByRowID() CALL kdsfbr() FFFFFFFF7CF575E0 ?

+996 FFFFFFFF7CF57CCC ?

FFFFFFFF7CF575E0 ?

000106529 ? 800137DC0 ?

106529DB0 ?

qergiFetch()+344 PTR_CALL 0000000000000000 1056DA000 ? 0000010E4 ?

000000000 ? 000000001 ?

FFFFFFFF7CF57548 ?

FFFFFFFF7CF57CC8 ?

qerjoFetch()+580 PTR_CALL 0000000000000000 47A4819D0 ? 000000000 ?

000000000 ?

FFFFFFFF7CF57738 ?

000000027 ? 000000800 ?

qertbFetchByRowID() PTR_CALL 0000000000000000 000000000 ? 1031FF060 ?

+556 FFFFFFFF7CF57C60 ?

47A4815E0 ? 47A4813F8 ?

000008000 ?

rwsfcd()+120 PTR_CALL 0000000000000000 FFFFFFFF7CF5AC78 ?

000000040 ? 47A481140 ?

000007FFF ? 47A4810D0 ?

00000013C ?

qerflFetch()+172 PTR_CALL 0000000000000000 47A480B30 ? 10320D5A0 ?

FFFFFFFF7FFFA908 ?

000007FFF ? 0000003C0 ?

105A287B0 ?

qersoFetch()+720 PTR_CALL 0000000000000000 FFFFFFFF7CF5C260 ?

103291C80 ? 47A4806E0 ?

000007FFF ? 10320D000 ?

103290B40 ?

opifch2()+8092 CALL kpofrws() 105A287B0 ? 1024D8940 ?

105A28000 ?

FFFFFFFF7CF5C278 ?

000000020 ? 47A4809B8 ?

kpoal8()+4196 CALL opifch2() 000000020 ?

FFFFFFFF7CF3A830 ?

000000000 ? 000106400 ?

000000001 ? 47A4861E8 ?

opiodr()+1548 PTR_CALL 0000000000000000 000000020 ? 000000000 ?

000105800 ? 000000004 ?

000000000 ? 106534460 ?

ttcpip()+1284 PTR_CALL 0000000000000000 10576A380 ? 00000005E ?

106527200 ? 000000001 ?

FFFFFFFF7CF3A830 ?

000106529 ?

opitsk()+1432 CALL ttcpip() 0000001AF ?

FFFFFFFF7FFFD220 ?

1056C06EC ? 1056BDED0 ?

000000000 ? 106527318 ?

opiino()+1128 CALL opitsk() 106534468 ? 000000001 ?

000000000 ? 106534460 ?

105884B38 ? 0FFFFFFFD ?

opiodr()+1548 PTR_CALL 0000000000000000 000106400 ? 106529D98 ?

000106400 ? 106529000 ?

000106400 ? 106534460 ?

opidrv()+896 CALL opiodr() 1065335D8 ? 00000003C ?

000106400 ? 1065343E0 ?

000106534 ? 000106529 ?

sou2o()+80 CALL opidrv() 106536B60 ? 000000000 ?

00000003C ? 106533898 ?

00000003C ? 000000000 ?

opimai_real()+124 CALL sou2o() FFFFFFFF7FFFF978 ?

00000003C ? 000000004 ?

FFFFFFFF7FFFF9A0 ?

105E0E000 ? 000105E0E ?

main()+152 CALL opimai_real() 000000002 ?

FFFFFFFF7FFFFA78 ?

104051BEC ? 1064CF450 ?

00247D764 ? 000014800 ?

_start()+380 CALL main() 000000002 ? 000000008 ?

000000000 ?

FFFFFFFF7FFFFA88 ?

FFFFFFFF7FFFFB98 ?

FFFFFFFF7D500200 ?

其中找了两篇比较有价值而且比较靠谱的文章，如下：

Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1]
After Upgrade To 10203 ,Select * From Gv$Aq Returns Ora-00600[kdsgrp1] Or [Qertbfetchbyrowid] [ID 603095.1]

最开始看ORA-600 [kdsgrp1] [ID 285586.1] 这篇文档的时候，我感觉这个问题很可能就是bug，然而看了上面2篇文章以后，
感觉不见得一定就是bug，另外还看了好几篇文档，都是与index相关或CR相关的，所以我断定这个问题是index和table之间
的数据不一致导致的。

其中Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1] 这篇是最有价值的，提供了一个解决思路：

首先根据ora-600错误提到的file和block号，定位到具体的对象：

SELECT tablespace_name, segment_type, owner, segment_name
FROM dba_extents
WHERE file_id = 8
and 630446 between block_id AND block_id + blocks - 1;

SELECT tablespace_name, segment_type, owner, segment_name

FROM dba_extents

WHERE file_id = 8

and 630446 between block_id AND block_id + blocks - 1;

然后从trace中搜索 Plan Table，找到执行计划，根据前面查到的对象，然后确定到具体的index名称。

============
Plan Table
============
------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                   | Rows  | Bytes | Cost  | Time      | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------
| 0   | SELECT STATEMENT                        |                        |       |       |  1230 |           |       |       |
| 1   |  COUNT STOPKEY                          |                        |       |       |       |           |       |       |
| 2   |   VIEW                                  |                        |   113 |  867K |  1230 |  00:00:15 |       |       |
| 3   |    SORT ORDER BY STOPKEY                |                        |   113 |  155K |  1230 |  00:00:15 |       |       |
| 4   |     NESTED LOOPS OUTER                  |                        |   113 |  155K |  1229 |  00:00:15 |       |       |
| 5   |      PARTITION RANGE SINGLE             |                        |   113 |  149K |  1228 |  00:00:15 | KEY   | KEY   |
| 6   |       TABLE ACCESS BY LOCAL INDEX ROWID | PAGELAYOUT             |   113 |  149K |  1228 |  00:00:15 | KEY   | KEY   |
| 7   |        INDEX RANGE SCAN                 | IDX_PAGELAYOUT_T_NODEID|   214 |       |     7 |  00:00:01 | KEY   | KEY   |
| 8   |      TABLE ACCESS BY INDEX ROWID        | DATASNIPARTICLE        |     1 |    57 |     1 |  00:00:01 |       |       |
| 9   |       INDEX UNIQUE SCAN                 | SYS_C0034918           |     1 |       |     0 |           |       |       |
------------------------------------------------------------------------------------------------------------------------------

============

Plan Table

============

------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | 1230 | | | |

| 1 | COUNT STOPKEY | | | | | | | |

| 2 | VIEW | | 113 | 867K | 1230 | 00:00:15 | | |

| 3 | SORT ORDER BY STOPKEY | | 113 | 155K | 1230 | 00:00:15 | | |

| 4 | NESTED LOOPS OUTER | | 113 | 155K | 1229 | 00:00:15 | | |

| 5 | PARTITION RANGE SINGLE | | 113 | 149K | 1228 | 00:00:15 | KEY | KEY |

| 6 | TABLE ACCESS BY LOCAL INDEX ROWID | PAGELAYOUT | 113 | 149K | 1228 | 00:00:15 | KEY | KEY |

| 7 | INDEX RANGE SCAN | IDX_PAGELAYOUT_T_NODEID| 214 | | 7 | 00:00:01 | KEY | KEY |

| 8 | TABLE ACCESS BY INDEX ROWID | DATASNIPARTICLE | 1 | 57 | 1 | 00:00:01 | | |

| 9 | INDEX UNIQUE SCAN | SYS_C0034918 | 1 | | 0 | | | |

------------------------------------------------------------------------------------------------------------------------------

最后drop和recreate该index即可解决该问题。
补充：从别的trace中，还发现了update语句出现该错误，如下：

*** 2011-06-24 10:33:18.007
ksedmp: internal or fatal error
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Current SQL statement for this session:
update pagelayout set publishstate=:1 where articleid = :2 and (masterid = :3 or nodeid = :4)

*** 2011-06-24 10:33:18.007

ksedmp: internal or fatal error

ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []

Current SQL statement for this session:

update pagelayout set publishstate=:1 where articleid = :2 and (masterid = :3 or nodeid = :4)

从而我确定是表pagelayout上的某个index出现问题。

CASE 3.

上周出差到吉林某客户，巡检时发现某套rac的awr居然没有生成，而且手工创建快照也没反应。总之说来，
比较怪异，而且还发现某个节点的/oracle 一共20g，oracle本身才8G左右，另外的12g不知道被什么东西
占据了，ls -ltr居然无法看见，经主机工程师检查，发现是diag进程占据了12g的磁盘空间。

在操作之前，我做了一个systemstate dump，对该dump我简单的分析了一下，发现了如下有用信息：

System global information:
processes: base c0000000ad649960, size 500, cleanup c0000000ad68c020
allocation: free sessions c0000000ad78d620, free calls 0000000000000000
control alloc errors: 0 (process), 0 (session), 0 (call)
PMON latch cleanup depth: 0
seconds since PMON's last scan for dead processes: 19  ###### 有19个死进程
###### 我们首先来搜索diag进程：######
PROCESS 3:
—————————————-
SO: c0000000ad68c810, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=3, calls cur/top: c0000000adab8358/c0000000adab8358, flag: (2) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 24
last post received-location: ksasnd
last process to post me: c0000000ad6956f0 235 0
last post sent: 0 0 48
last post sent-location: ksoreq_reply
last process posted by me: c0000000ad782910 1 0
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27628
OSD pid info: Unix process pid: 27628, image: oracle@uamdb2 (DIAG)
###### 然后搜索c0000000ad68c810 字符串，发现如下内容：######
SO: c0000000ab2bac00, type: 19, owner: c0000000ad68c810, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=0
outq=0 sndq=0 opid=0 prmb=0x0
mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=0 mop[q]=0 pendq=0 zmbq=0
nonksxp_recvs=0
————process 0xc0000000ab2bac00——————–
proc version      : 0
Local node        : 1
pid               : 27628
lkp_node          : 1
svr_mode          : 0
proc state        : KJP_UNFREEZE
Last drm hb acked : 0
Total accesses    : 5
Imm.  accesses    : 0
Locks on ASTQ     : 0
Locks Pending AST : 0
Granted locks     : 0
AST_Q:
PENDING_Q:
GRANTED_Q:
—————————————-
SO: c0000000ada72528, type: 4, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(session) sid: 554 trans: 0000000000000000, creator: c0000000ad68c810, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'DIAG idle wait' blocking sess=0x0000000000000000 seq=331 wait_time=0 seconds since wait started=62193
component=1, where=1, wait time(millisec)=c8
Dumping Session Wait History
for 'DIAG idle wait' count=1 wait_time=195274
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195286
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=214760
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195267
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195282
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195282
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195281
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195278
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195280
component=1, where=1, wait time(millisec)=c8
for 'DIAG idle wait' count=1 wait_time=195270
component=1, where=1, wait time(millisec)=c8
temporary object counter: 0
—————————————-
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log c0000000acb7a5a0
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000acb7a5e8
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000acb7a630
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 554 efd 0 whr 0 slp 0
—————————————-
SO: c0000000a1a7fa18, type: 41, owner: c0000000ada72528, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0  
—————————————-
SO: c0000000adab8358, type: 3, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(call) sess: cur c0000000ada72528, rec 0, usr c0000000ada72528; depth: 0
—————————————-
SO: c0000000ab2adec8, type: 16, owner: c0000000ad68c810, flag: INIT/-/-/0x00
(osp req holder)
###### 然后来看mmon进程：######
SO: c0000000add59f60, type: 11, owner: c0000000ad692f40, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad692f40,
event: 1, last message event: 1381841,
last message waited event: 1381841, messages read: 1376494
channel: (c0000000add81260) MMON remote action broadcast channel
scope: 27, event: 1381841, last mesage event: 1381841,
publishers/subscribers: 2/2,
messages published: 1376494
—————————————-
SO: c0000000ada60ee0, type: 4, owner: c0000000ad692f40, flag: INIT/-/-/0x00
(session) sid: 541 trans: 0000000000000000, creator: c0000000ad692f40, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0010-00000003, short-term DID: 0002-0010-00000004
txn branch: 0000000000000000
oct: 0, prv: 0, sql: c0000000ac7f3760, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=49622 wait_time=0 seconds since wait started=2
timeout=12c, =0, =0
Dumping Session Wait History
for 'rdbms ipc message' count=1 wait_time=1503849
timeout=99, =0, =0
for 'rdbms ipc message' count=1 wait_time=1435494
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2939422
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=517538
timeout=34, =0, =0
for 'rdbms ipc message' count=1 wait_time=2421822
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2470581
timeout=fc, =0, =0
for 'rdbms ipc message' count=1 wait_time=1
timeout=fc, =0, =0
for 'rdbms ipc message' count=1 wait_time=468661
timeout=12c, =0, =0
for 'rdbms ipc message' count=1 wait_time=2933950
timeout=12c, =0, =0
for 'os thread startup' count=1 wait_time=41005
=0, =0, =0
另外看了几个其他进程的信息，发现都是在等待 rdbms ipc message。
记得当时还做了一个oradebug hanganalyze 3，发现lck0进程居然阻塞了451个objects。
由于当时的产生的trace我忘记取下来了, 所以就只能用这个trace中去寻找信息了。
###### 搜索lck字符串：######
PROCESS 20:
—————————————-
SO: c0000000ad694f00, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=20, calls cur/top: c0000000adabb810/c0000000adabb810, flag: (6) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 33
last post received-location: ksrpublish
last process to post me: c0000000ad694f00 1 6
last post sent: 0 0 33
last post sent-location: ksrpublish
last process posted by me: c0000000ad694f00 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27684
OSD pid info: Unix process pid: 27684, image: oracle@uamdb2 (LCK0)
###### 然后搜索字符串c0000000ad694f00 ######
SO: c0000000adbb9fd0, type: 5, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(enqueue) XR-00000000-00000000 DID: 0000-0014-00000003
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  res_flag: 0x3
res: 0xc0000000adcbf510, mode: NULL, lock_flag: 0x0
own: 0xc0000000ada5b940, sess: 0xc0000000ada5b940, proc: 0xc0000000ad694f00, prv: 0xc0000000adcbf520
slk: 0xc0000000ab40db98
这段内容是enqueue state object信息，发现该进程持有一个enqueue。
从这里的 res_flag: 0x3 可以判断：这个enqueue是global类型的，而且正在被持有，还没被free。
继续往下搜索，发现了如下类似的大量信息：
SO: c0000000add5e1e8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 3, last message event: 3,
last message waited event: 3, messages read: 0
channel: (c0000000add81108) kea interrupt signal channel
scope: 5, event: 3, last mesage event: 0,
publishers/subscribers: 1/2,
messages published: 0
—————————————-
SO: c0000000add5e0d8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 3, last message event: 3326313,
last message waited event: 3326313, messages read: 3324562
channel: (c0000000add7ee18) kxfp control signal channel
scope: 5, event: 3326313, last mesage event: 3326313,
publishers/subscribers: 4/2,
messages published: 3324562
这里的type值为11，代表的意思是：Broadcast Handle(即广播消息句柄）。难怪做oradebug的时候，发现这个lck进程
阻塞了451个objects，而且等待居然都是rdbms ipc message。
继续往下搜索，发现了如下关键信息：
—————————————-
SO: c0000000ada5b940, type: 4, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(session) sid: 537 trans: 0000000000000000, creator: c0000000ad694f00, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0000-0014-00000003, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=43421 wait_time=0 seconds since wait started=1
timeout=12c, =0, =0
Dumping Session Wait History
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=1906819
timeout=c2, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=9
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=12
timeout=c2, =0, =0
for 'rdbms ipc message' count=1 wait_time=1041870
timeout=12c, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=7
=0, =0, =0
for 'ksxr poll remote instances' count=1 wait_time=10
=0, =0, =0
for 'rdbms ipc message' count=1 wait_time=807692
timeout=52, =0, =0
for 'rdbms ipc message' count=1 wait_time=27
timeout=52, =0, =0
temporary object counter: 0
—————————————-
UOL used : 0 locks(used=0, free=0)
KGX Atomic Operation Log c0000000aca2e6e8
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Cursor Pin uid 537 efd 3 whr 11 slp 0
KGX Atomic Operation Log c0000000aca2e730
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 537 efd 0 whr 0 slp 0
KGX Atomic Operation Log c0000000aca2e778
Mutex 0000000000000000(0, 0) idn 0 oper NONE
Library Cache uid 537 efd 0 whr 0 slp 0
—————————————-
SO: c0000000a1a7fe78, type: 41, owner: c0000000ada5b940, flag: INIT/-/-/0x00
(dummy) nxc=0, nlb=0  
—————————————-
SO: c0000000add5ab28, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00,
event: 20, last message event: 26,
last message waited event: 26, messages read: 1
channel: (c0000000add7a2d8) system events broadcast channel
scope: 2, event: 204545, last mesage event: 26,
publishers/subscribers: 0/75,
messages published: 1
—————————————-
SO: c0000000ab2ba708, type: 19, owner: c0000000ad694f00, flag: INIT/-/-/0x00
GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=36
outq=0 sndq=0 opid=20 prmb=0x0
mbg[i]=(3277535 648561) mbg[b]=(0 0) mbg[r]=(0 0)
fmq[i]=(4 0) fmq[b]=(0 0) fmq[r]=(0 0)
mop[s]=648561 mop[q]=3277535 pendq=0 zmbq=0
nonksxp_recvs=0
————process 0xc0000000ab2ba708——————–
proc version      : 0
Local node        : 1
pid               : 27684
lkp_node          : 1
svr_mode          : 0
proc state        : KJP_NORMAL
Last drm hb acked : 0
Total accesses    : 221605743
Imm.  accesses    : 196696635
Locks on ASTQ     : 0
Locks Pending AST : 113
Granted locks     : 56081
AST_Q:
PENDING_Q:
lp 0xc0000000acb505c0 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x9565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK 
lp 0xc00000007a220300 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK 
lp 0xc000000074ac4c38 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK 
lp 0xc0000000748ca940 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV]
master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a
这里的rp 0xc000000094eeda88 我猜应该是enqueue等待的资源名，于是搜索resouce 0xc000000094eeda88字符串，发现了如下信息：
———-enqueue 0xc00000007a3cb2e0————————
lock version     : 1
Owner node       : 1
grant_level      : KJUSERNL
req_level        : KJUSERPR
bast_level       : KJUSERNL
notify_func      : 0x0000000000000000
resp             : 0xc000000094eeda88
procp            : 0xc0000000ab2ba708
pid              : 27684
proc version     : 0
oprocp           : 0x0000000000000000
opid             : 0
group lock owner : 0x0000000000000000
xid              : 0000-0000-00000000
dd_time          : 0.0 secs
dd_count         : 0
timeout          : 0.0 secs
On_timer_q?      : N
On_dd_q?         : N
lock_state       : CONVERTING
Open Options     :  KJUSERPROCESS_OWNED KJUSERCLIENTLOCK
Convert options  : KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK
History          : 0x565a565a
Msg_Seq          : 0x0
res_seq          : 1
valblk           : 0x000000020001fcc00000000000000000 .
———-resource 0xc000000094eeda88———————-
resname       : [0x2345][0xffffffff],[IV]
Local node    : 1
dir_node      : 1
master_node   : 1
hv idx        : 35
hv last r.inc : 6
current inc   : 36
hv status     : 0
hv master     : 1
open options  :
grant_bits    : KJUSERNL KJUSERPR
grant mode    : KJUSERNL  KJUSERCR  KJUSERCW  KJUSERPR  KJUSERPW  KJUSEREX
count         : 114         0         0         333         0         0
val_state     : KJUSERVS_VALUE
valblk        : 0x000000020001fcc00000000000000000 .
access_node   : 1
vbreq_state   : 0
state         : x8
resp          : 0xc000000094eeda88
On Scan_q?    : N
Total accesses: 4524902
Imm.  accesses: 1948489
Granted_locks : 333
Cvting_locks  : 114
总的来说，给人的感觉似乎lck是罪魁祸首，其实不竟然。 
用awk脚本格式化了这个systemstate dump后，得到下面的结果：
System State 1
~~~~~~~~~~~~~~~~
1:                                     
2:  waiting for 'pmon timer'            wait
3:  waiting for 'DIAG idle wait'        wait
4:  waiting for 'rdbms ipc message'     wait
5:  waiting for 'rdbms ipc message'     wait
6:  waiting for 'ges remote message'    wait
7:  waiting for 'gcs remote message'    wait
8:  waiting for 'gcs remote message'    wait
9:  waiting for 'rdbms ipc message'     wait
10: waiting for 'rdbms ipc message'     wait
11: waiting for 'rdbms ipc message'     wait
12: waiting for 'rdbms ipc message'     wait
13: waiting for 'smon timer'            wait
14: waiting for 'rdbms ipc message'     wait
15: waiting for 'rdbms ipc message'     wait
16: waiting for 'rdbms ipc message'     wait
17: waiting for 'rdbms ipc message'     wait
18:                                    
19:                                    
20: waiting for 'rdbms ipc message'     wait
21: waiting for 'SQL*Net message from client' wait
22: waiting for 'rdbms ipc message'     wait
23: waiting for 'rdbms ipc message'     wait
24: waiting for 'SQL*Net message from client' wait
25: waiting for 'Streams AQ: qmn coordinator idle wait' wait
26: waiting for 'SQL*Net message from client' wait
27: waiting for 'SQL*Net message from client' wait
Cmd: PL/SQL Execute
28: for 'Streams AQ: waiting for time management or cleanup tasks' wait
29:                                    
30: waiting for 'SQL*Net message from client' wait
31: waiting for 'SQL*Net message from client' wait
32: waiting for 'SQL*Net message from client' wait
33: waiting for 'SQL*Net message from client' wait
34: waiting for 'SQL*Net message from client' wait
35: waiting for 'SQL*Net message from client' wait
36: waiting for 'SQL*Net message from client' wait
37: waiting for 'SQL*Net message from client' wait
38: waiting for 'SQL*Net message from client' wait
39: waiting for 'SQL*Net message from client' wait
Cmd: Insert
40: waiting for 'SQL*Net message from client' wait
41: waiting for 'SQL*Net message from client' wait
42: last wait for 'SQL*Net message from client'
43: waiting for 'SQL*Net message from client' wait
45: waiting for 'Streams AQ: qmn slave idle wait' wait
46: waiting for 'SQL*Net message from client' wait
Cmd: Insert
47: waiting for 'SQL*Net message from client' wait
48: waiting for 'SQL*Net message from client' wait
49: waiting for 'SQL*Net message from client' wait
50: waiting for 'SQL*Net message from client' wait
52: waiting for 'SQL*Net message from client' wait
53: waiting for 'SQL*Net message from client' wait
54: waiting for 'SQL*Net message from client' wait
55: waiting for 'SQL*Net message from client' wait
56: waiting for 'SQL*Net message from client' wait
58: waiting for 'SQL*Net message from client' wait
59: waiting for 'SQL*Net message from client' wait
60: waiting for 'SQL*Net message from client' wait
61: waiting for 'SQL*Net message from client' wait
65: waiting for 'SQL*Net message from client' wait
66: waiting for 'SQL*Net message from client' wait
90: waiting for 'SQL*Net message from client' wait
91: waiting for 'SQL*Net message from client' wait
92: waiting for 'SQL*Net message from client' wait
93: waiting for 'SQL*Net message from client' wait
94: waiting for 'SQL*Net message from client' wait
95: waiting for 'SQL*Net message from client' wait
96: waiting for 'SQL*Net message from client' wait
106:waiting for 'SQL*Net message from client' wait
107:waiting for 'SQL*Net message from client' wait
108:waiting for 'SQL*Net message from client' wait
109:waiting for 'SQL*Net message from client' wait
110:waiting for 'SQL*Net message from client' wait
111:waiting for 'SQL*Net message from client' wait
112:waiting for 'SQL*Net message from client' wait
113:waiting for 'SQL*Net message from client' wait
114:waiting for 'SQL*Net message from client' wait
115:waiting for 'SQL*Net message from client' wait
116:waiting for 'SQL*Net message from client' wait
117:waiting for 'SQL*Net message from client' wait
120:waiting for 'SQL*Net message from client' wait
NO BLOCKING PROCESSES FOUND
881983 Lines Processed.
从分析结果来看，断定pmon进程才是真正的罪魁祸首，搜索pmon关键字:
PROCESS 2:
—————————————-
SO: c0000000ad68c020, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: c0000000adab8098/c0000000adab8098, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: c0000000ad692750 4 2
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c0000000ad68d000 1 6
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: c0000000ad784f20
O/S info: user: oracle, term: UNKNOWN, ospid: 27626
OSD pid info: Unix process pid: 27626, image: oracle@uamdb2 (PMON)
###### 省略部分信息 ######
—————————————-
SO: c0000000add586a8, type: 11, owner: c0000000ad68c020, flag: INIT/-/-/0x00
(broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad68c020,
event: 1, last message event: 1,
last message waited event: 1, messages read: 0
channel: (c0000000add7e4b0) scumnt mount lock
scope: 1, event: 15, last mesage event: 0,
publishers/subscribers: 0/14,
messages published: 0
—————————————-
SO: c0000000ada73a90, type: 4, owner: c0000000ad68c020, flag: INIT/-/-/0x00
(session) sid: 555 trans: 0000000000000000, creator: c0000000ad68c020, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0002-0002-00000004, short-term DID: 0000-0000-00000000
txn branch: 0000000000000000
oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS
service name: SYS$BACKGROUND
waiting for 'pmon timer' blocking sess=0x0000000000000000 seq=50 wait_time=0 seconds since wait started=75789
duration=12c, =0, =0
Dumping Session Wait History
for 'pmon timer' count=1 wait_time=2929152
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929462
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929351
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929175
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2928083
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929290
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929509
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929262
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929283
duration=12c, =0, =0
for 'pmon timer' count=1 wait_time=2929525
duration=12c, =0, =0
temporary object counter: 0

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

System global information:

processes: base c0000000ad649960, size 500, cleanup c0000000ad68c020

allocation: free sessions c0000000ad78d620, free calls 0000000000000000

control alloc errors: 0 (process), 0 (session), 0 (call)

PMON latch cleanup depth: 0

seconds since PMON's last scan for dead processes: 19 ###### 有19个死进程

###### 我们首先来搜索diag进程：######

PROCESS 3:

—————————————-

SO: c0000000ad68c810, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00

(process) Oracle pid=3, calls cur/top: c0000000adab8358/c0000000adab8358, flag: (2) SYSTEM

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 0 0 24

last post received-location: ksasnd

last process to post me: c0000000ad6956f0 235 0

last post sent: 0 0 48

last post sent-location: ksoreq_reply

last process posted by me: c0000000ad782910 1 0

(latch info) wait_event=0 bits=0

Process Group: DEFAULT, pseudo proc: c0000000ad784f20

O/S info: user: oracle, term: UNKNOWN, ospid: 27628

OSD pid info: Unix process pid: 27628, image: oracle@uamdb2 (DIAG)

###### 然后搜索c0000000ad68c810 字符串，发现如下内容：######

SO: c0000000ab2bac00, type: 19, owner: c0000000ad68c810, flag: INIT/-/-/0x00

GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=0

outq=0 sndq=0 opid=0 prmb=0x0

mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0)

fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0)

mop[s]=0 mop[q]=0 pendq=0 zmbq=0

nonksxp_recvs=0

————process 0xc0000000ab2bac00——————–

proc version : 0

Local node : 1

pid : 27628

lkp_node : 1

svr_mode : 0

proc state : KJP_UNFREEZE

Last drm hb acked : 0

Total accesses : 5

Imm. accesses : 0

Locks on ASTQ : 0

Locks Pending AST : 0

Granted locks : 0

AST_Q:

PENDING_Q:

GRANTED_Q:

—————————————-

SO: c0000000ada72528, type: 4, owner: c0000000ad68c810, flag: INIT/-/-/0x00

(session) sid: 554 trans: 0000000000000000, creator: c0000000ad68c810, flag: (51) USR/- BSY/-/-/-/-/-

DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000

txn branch: 0000000000000000

oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS

service name: SYS$BACKGROUND

waiting for 'DIAG idle wait' blocking sess=0x0000000000000000 seq=331 wait_time=0 seconds since wait started=62193