最近一周遇到的几个case
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 最近一周遇到的几个case
最近一周遇到了好几个case,简单的总结一下,跟大家分享.
CASE 1.
某客户的一套9208的库,在跑批处理的时候,异常crash了,其中alert 信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
Thread 1 advanced to log sequence 51938 Current log# 2 seq# 51938 mem# 0: /oradata/cifs/redologfiles/redo21.log Current log# 2 seq# 51938 mem# 1: /oradata/cifs/redologfiles/redo22.log Sun Jun 26 11:35:16 2011 Hex dump of Absolute File 4, Block 53784 in trace file /oracle/app/oracle/admin/cifs/ bdump/cifs_dbw0_225460.trc *** Corrupt block relative dba: 0x0100d218 (file 4, block 53784) Bad header found during preparing block for write Data in bad block - type: 6 format: 2 rdba: 0x04c05738 last change scn: 0x0001.42162bd7 seq: 0x2 flg: 0x04 consistency value in tail: 0x2bd70602 check value in block header: 0x3d2d, computed block checksum: 0x0 spare1: 0x0, spare2: 0x0, spare3: 0x0 *** Sun Jun 26 11:35:16 2011 Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc: ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], [] Sun Jun 26 11:35:17 2011 Errors in file /oracle/app/oracle/admin/cifs/bdump/cifs_dbw0_225460.trc: ORA-00600: internal error code, arguments: [kcbzpb_1], [16831000], [4], [1], [], [], [], [] DBW0: terminating instance due to error 600 Instance terminated by DBW0, pid = 225460 Sun Jun 26 13:06:07 2011 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 SCN scheme 3 Using log_archive_dest parameter default value LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up ORACLE RDBMS Version: 9.2.0.8.0. |
从上面的Bad header found during preparing block for write 信息可以看出,dbw0进程在写数据库的时候,
发现这个block header信息异常。我们知道 tail=scn后4位+flg+seq 根据这个规则来看,很明显信息不对。
单从这个ora-600错误来看,metalink也有相关的文档解释,总的来说就是bug,这里根据
16831000 为10进制,首先转换为16进制,然后再转换为 file 4 和 block 53784,即得到上面的坏块号。
metalink 文档,大家可以参考ORA-600 [kcbzpb_1] [ID 139011.1] 其中提到的
bug 6401576 9.2.0.8.P22 OERI[ktbair1] / ORA-600 [6101] index corruption possible
跟我这里遇到的情况基本上一致,对于这种逻辑错误,根据oracle的解释,startup实例即可消除问题。
CASE 2.
某客户的其中一套10203 rac(asm)的某个节点出现ora-600错误,alert信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
Fri Jun 24 09:06:49 2011 Thread 1 advanced to log sequence 7331 Current log# 2 seq# 7331 mem# 0: +DG_DATA/cmsdb1/onlinelog/redo02_1.log Current log# 2 seq# 7331 mem# 1: +DG_DATA/cmsdb1/onlinelog/redo02_2.log Fri Jun 24 10:32:58 2011 Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_2083.trc: ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [] Fri Jun 24 10:33:05 2011 Trace dumping is performing id=[cdmp_20110624103305] Fri Jun 24 10:33:18 2011 ###### 省略部分内容 ###### Fri Jun 24 15:09:28 2011 Errors in file /oracle/admin/cmsdb1/udump/cmsdb11_ora_8188.trc: ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [] Fri Jun 24 15:09:29 2011 Trace dumping is performing id=[cdmp_20110624150929] |
从其中的某个trace文件,我发现了如下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
*** 2011-06-24 10:33:20.755 *** SERVICE NAME:(cmsdb1) 2011-06-24 10:33:20.731 *** SESSION ID:(173.17308) 2011-06-24 10:33:20.731 row 02099eae.1 continuation at file# 8 block# 630446 slot 2 not found ************************************************** KDSTABN_GET: 0 ….. ntab: 1 curSlot: 2 ….. nrows: 8 ************************************************** *** 2011-06-24 10:33:20.755 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [] Current SQL statement for this session: SELECT m.url,p.pubtime FROM mastersite m, pagelayout p WHERE m.nodeid = p.nodeid and m.articleid=p.articleid AND p.pubtime BETWEEN SYSDATE – (:”SYS_B_0″ / :”SYS_B_1″) AND SYSDATE AND m.url IS NOT NULL AND m.url not like :”SYS_B_2″ order by p.pubtime —– Call Stack Trace —– calling call entry argument values in hex location type point (? means dubious value) ——————– ——– ——————– —————————- ksedmp()+744 CALL ksedst() 000000840 ? FFFFFFFF7FFF915C ? 000000000 ? FFFFFFFF7FFF5C50 ? FFFFFFFF7FFF49B8 ? FFFFFFFF7FFF53B8 ? kgerinv()+200 PTR_CALL 0000000000000000 000106400 ? 10652C964 ? 10652C000 ? 00010652C ? 000106400 ? 10652C964 ? kgeasnmierr()+28 CALL kgerinv() 106527318 ? 000000000 ? 10610F130 ? 000000000 ? FFFFFFFF7FFF9600 ? 000001430 ? kdsgrp()+1252 CALL kgeasnmierr() 106527318 ? FFFFFFFF7CF26270 ? 10610F130 ? 000000000 ? 10610F000 ? 00010610F ? kdsfbr()+240 CALL kdsgrp() FFFFFFFF7CF575E0 ? 000000000 ? 000106000 ? 10610F130 ? 000000000 ? 10652C958 ? qertbFetchByRowID() CALL kdsfbr() FFFFFFFF7CF575E0 ? +996 FFFFFFFF7CF57CCC ? FFFFFFFF7CF575E0 ? 000106529 ? 800137DC0 ? 106529DB0 ? qergiFetch()+344 PTR_CALL 0000000000000000 1056DA000 ? 0000010E4 ? 000000000 ? 000000001 ? FFFFFFFF7CF57548 ? FFFFFFFF7CF57CC8 ? qerjoFetch()+580 PTR_CALL 0000000000000000 47A4819D0 ? 000000000 ? 000000000 ? FFFFFFFF7CF57738 ? 000000027 ? 000000800 ? qertbFetchByRowID() PTR_CALL 0000000000000000 000000000 ? 1031FF060 ? +556 FFFFFFFF7CF57C60 ? 47A4815E0 ? 47A4813F8 ? 000008000 ? rwsfcd()+120 PTR_CALL 0000000000000000 FFFFFFFF7CF5AC78 ? 000000040 ? 47A481140 ? 000007FFF ? 47A4810D0 ? 00000013C ? qerflFetch()+172 PTR_CALL 0000000000000000 47A480B30 ? 10320D5A0 ? FFFFFFFF7FFFA908 ? 000007FFF ? 0000003C0 ? 105A287B0 ? qersoFetch()+720 PTR_CALL 0000000000000000 FFFFFFFF7CF5C260 ? 103291C80 ? 47A4806E0 ? 000007FFF ? 10320D000 ? 103290B40 ? opifch2()+8092 CALL kpofrws() 105A287B0 ? 1024D8940 ? 105A28000 ? FFFFFFFF7CF5C278 ? 000000020 ? 47A4809B8 ? kpoal8()+4196 CALL opifch2() 000000020 ? FFFFFFFF7CF3A830 ? 000000000 ? 000106400 ? 000000001 ? 47A4861E8 ? opiodr()+1548 PTR_CALL 0000000000000000 000000020 ? 000000000 ? 000105800 ? 000000004 ? 000000000 ? 106534460 ? ttcpip()+1284 PTR_CALL 0000000000000000 10576A380 ? 00000005E ? 106527200 ? 000000001 ? FFFFFFFF7CF3A830 ? 000106529 ? opitsk()+1432 CALL ttcpip() 0000001AF ? FFFFFFFF7FFFD220 ? 1056C06EC ? 1056BDED0 ? 000000000 ? 106527318 ? opiino()+1128 CALL opitsk() 106534468 ? 000000001 ? 000000000 ? 106534460 ? 105884B38 ? 0FFFFFFFD ? opiodr()+1548 PTR_CALL 0000000000000000 000106400 ? 106529D98 ? 000106400 ? 106529000 ? 000106400 ? 106534460 ? opidrv()+896 CALL opiodr() 1065335D8 ? 00000003C ? 000106400 ? 1065343E0 ? 000106534 ? 000106529 ? sou2o()+80 CALL opidrv() 106536B60 ? 000000000 ? 00000003C ? 106533898 ? 00000003C ? 000000000 ? opimai_real()+124 CALL sou2o() FFFFFFFF7FFFF978 ? 00000003C ? 000000004 ? FFFFFFFF7FFFF9A0 ? 105E0E000 ? 000105E0E ? main()+152 CALL opimai_real() 000000002 ? FFFFFFFF7FFFFA78 ? 104051BEC ? 1064CF450 ? 00247D764 ? 000014800 ? _start()+380 CALL main() 000000002 ? 000000008 ? 000000000 ? FFFFFFFF7FFFFA88 ? FFFFFFFF7FFFFB98 ? FFFFFFFF7D500200 ? |
其中找了两篇比较有价值而且比较靠谱的文章,如下:
Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1]
After Upgrade To 10203 ,Select * From Gv$Aq Returns Ora-00600[kdsgrp1] Or [Qertbfetchbyrowid] [ID 603095.1]
最开始看ORA-600 [kdsgrp1] [ID 285586.1] 这篇文档的时候,我感觉这个问题很可能就是bug,然而看了上面2篇文章以后,
感觉不见得一定就是bug,另外还看了好几篇文档,都是与index相关或CR相关的,所以我断定这个问题是index和table之间
的数据不一致导致的。
其中Getting Ora-600 [kdsgrp1] On A Select [ID 429746.1] 这篇是最有价值的,提供了一个解决思路:
首先根据ora-600错误提到的file和block号,定位到具体的对象:
1 2 3 4 |
SELECT tablespace_name, segment_type, owner, segment_name FROM dba_extents WHERE file_id = 8 and 630446 between block_id AND block_id + blocks - 1; |
然后从trace中搜索 Plan Table,找到执行计划,根据前面查到的对象,然后确定到具体的index名称。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
============ Plan Table ============ ------------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost | Time | Pstart| Pstop | ------------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 1230 | | | | | 1 | COUNT STOPKEY | | | | | | | | | 2 | VIEW | | 113 | 867K | 1230 | 00:00:15 | | | | 3 | SORT ORDER BY STOPKEY | | 113 | 155K | 1230 | 00:00:15 | | | | 4 | NESTED LOOPS OUTER | | 113 | 155K | 1229 | 00:00:15 | | | | 5 | PARTITION RANGE SINGLE | | 113 | 149K | 1228 | 00:00:15 | KEY | KEY | | 6 | TABLE ACCESS BY LOCAL INDEX ROWID | PAGELAYOUT | 113 | 149K | 1228 | 00:00:15 | KEY | KEY | | 7 | INDEX RANGE SCAN | IDX_PAGELAYOUT_T_NODEID| 214 | | 7 | 00:00:01 | KEY | KEY | | 8 | TABLE ACCESS BY INDEX ROWID | DATASNIPARTICLE | 1 | 57 | 1 | 00:00:01 | | | | 9 | INDEX UNIQUE SCAN | SYS_C0034918 | 1 | | 0 | | | | ------------------------------------------------------------------------------------------------------------------------------ |
最后drop和recreate该index即可解决该问题。
补充:从别的trace中,还发现了update语句出现该错误,如下:
1 2 3 4 5 |
*** 2011-06-24 10:33:18.007 ksedmp: internal or fatal error ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [] Current SQL statement for this session: update pagelayout set publishstate=:1 where articleid = :2 and (masterid = :3 or nodeid = :4) |
从而我确定是表pagelayout上的某个index出现问题。
CASE 3.
上周出差到吉林某客户,巡检时发现某套rac的awr居然没有生成,而且手工创建快照也没反应。总之说来,
比较怪异,而且还发现某个节点的/oracle 一共20g,oracle本身才8G左右,另外的12g不知道被什么东西
占据了,ls -ltr居然无法看见,经主机工程师检查,发现是diag进程占据了12g的磁盘空间。
在操作之前,我做了一个systemstate dump,对该dump我简单的分析了一下,发现了如下有用信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 |
System global information: processes: base c0000000ad649960, size 500, cleanup c0000000ad68c020 allocation: free sessions c0000000ad78d620, free calls 0000000000000000 control alloc errors: 0 (process), 0 (session), 0 (call) PMON latch cleanup depth: 0 seconds since PMON's last scan for dead processes: 19 ###### 有19个死进程 ###### 我们首先来搜索diag进程:###### PROCESS 3: —————————————- SO: c0000000ad68c810, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=3, calls cur/top: c0000000adab8358/c0000000adab8358, flag: (2) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 24 last post received-location: ksasnd last process to post me: c0000000ad6956f0 235 0 last post sent: 0 0 48 last post sent-location: ksoreq_reply last process posted by me: c0000000ad782910 1 0 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c0000000ad784f20 O/S info: user: oracle, term: UNKNOWN, ospid: 27628 OSD pid info: Unix process pid: 27628, image: oracle@uamdb2 (DIAG) ###### 然后搜索c0000000ad68c810 字符串,发现如下内容:###### SO: c0000000ab2bac00, type: 19, owner: c0000000ad68c810, flag: INIT/-/-/0x00 GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=0 outq=0 sndq=0 opid=0 prmb=0x0 mbg[i]=(0 0) mbg[b]=(0 0) mbg[r]=(0 0) fmq[i]=(0 0) fmq[b]=(0 0) fmq[r]=(0 0) mop[s]=0 mop[q]=0 pendq=0 zmbq=0 nonksxp_recvs=0 ————process 0xc0000000ab2bac00——————– proc version : 0 Local node : 1 pid : 27628 lkp_node : 1 svr_mode : 0 proc state : KJP_UNFREEZE Last drm hb acked : 0 Total accesses : 5 Imm. accesses : 0 Locks on ASTQ : 0 Locks Pending AST : 0 Granted locks : 0 AST_Q: PENDING_Q: GRANTED_Q: —————————————- SO: c0000000ada72528, type: 4, owner: c0000000ad68c810, flag: INIT/-/-/0x00 (session) sid: 554 trans: 0000000000000000, creator: c0000000ad68c810, flag: (51) USR/- BSY/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'DIAG idle wait' blocking sess=0x0000000000000000 seq=331 wait_time=0 seconds since wait started=62193 component=1, where=1, wait time(millisec)=c8 Dumping Session Wait History for 'DIAG idle wait' count=1 wait_time=195274 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195286 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=214760 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195267 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195282 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195282 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195281 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195278 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195280 component=1, where=1, wait time(millisec)=c8 for 'DIAG idle wait' count=1 wait_time=195270 component=1, where=1, wait time(millisec)=c8 temporary object counter: 0 —————————————- UOL used : 0 locks(used=0, free=0) KGX Atomic Operation Log c0000000acb7a5a0 Mutex 0000000000000000(0, 0) idn 0 oper NONE Library Cache uid 554 efd 0 whr 0 slp 0 KGX Atomic Operation Log c0000000acb7a5e8 Mutex 0000000000000000(0, 0) idn 0 oper NONE Library Cache uid 554 efd 0 whr 0 slp 0 KGX Atomic Operation Log c0000000acb7a630 Mutex 0000000000000000(0, 0) idn 0 oper NONE Library Cache uid 554 efd 0 whr 0 slp 0 —————————————- SO: c0000000a1a7fa18, type: 41, owner: c0000000ada72528, flag: INIT/-/-/0x00 (dummy) nxc=0, nlb=0 —————————————- SO: c0000000adab8358, type: 3, owner: c0000000ad68c810, flag: INIT/-/-/0x00 (call) sess: cur c0000000ada72528, rec 0, usr c0000000ada72528; depth: 0 —————————————- SO: c0000000ab2adec8, type: 16, owner: c0000000ad68c810, flag: INIT/-/-/0x00 (osp req holder) ###### 然后来看mmon进程:###### SO: c0000000add59f60, type: 11, owner: c0000000ad692f40, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad692f40, event: 1, last message event: 1381841, last message waited event: 1381841, messages read: 1376494 channel: (c0000000add81260) MMON remote action broadcast channel scope: 27, event: 1381841, last mesage event: 1381841, publishers/subscribers: 2/2, messages published: 1376494 —————————————- SO: c0000000ada60ee0, type: 4, owner: c0000000ad692f40, flag: INIT/-/-/0x00 (session) sid: 541 trans: 0000000000000000, creator: c0000000ad692f40, flag: (51) USR/- BSY/-/-/-/-/- DID: 0002-0010-00000003, short-term DID: 0002-0010-00000004 txn branch: 0000000000000000 oct: 0, prv: 0, sql: c0000000ac7f3760, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=49622 wait_time=0 seconds since wait started=2 timeout=12c, =0, =0 Dumping Session Wait History for 'rdbms ipc message' count=1 wait_time=1503849 timeout=99, =0, =0 for 'rdbms ipc message' count=1 wait_time=1435494 timeout=12c, =0, =0 for 'rdbms ipc message' count=1 wait_time=2939422 timeout=12c, =0, =0 for 'rdbms ipc message' count=1 wait_time=517538 timeout=34, =0, =0 for 'rdbms ipc message' count=1 wait_time=2421822 timeout=12c, =0, =0 for 'rdbms ipc message' count=1 wait_time=2470581 timeout=fc, =0, =0 for 'rdbms ipc message' count=1 wait_time=1 timeout=fc, =0, =0 for 'rdbms ipc message' count=1 wait_time=468661 timeout=12c, =0, =0 for 'rdbms ipc message' count=1 wait_time=2933950 timeout=12c, =0, =0 for 'os thread startup' count=1 wait_time=41005 =0, =0, =0 另外看了几个其他进程的信息,发现都是在等待 rdbms ipc message。 记得当时还做了一个oradebug hanganalyze 3,发现lck0进程居然阻塞了451个objects。 由于当时的产生的trace我忘记取下来了, 所以就只能用这个trace中去寻找信息了。 ###### 搜索lck字符串:###### PROCESS 20: —————————————- SO: c0000000ad694f00, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=20, calls cur/top: c0000000adabb810/c0000000adabb810, flag: (6) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 33 last post received-location: ksrpublish last process to post me: c0000000ad694f00 1 6 last post sent: 0 0 33 last post sent-location: ksrpublish last process posted by me: c0000000ad694f00 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c0000000ad784f20 O/S info: user: oracle, term: UNKNOWN, ospid: 27684 OSD pid info: Unix process pid: 27684, image: oracle@uamdb2 (LCK0) ###### 然后搜索字符串c0000000ad694f00 ###### SO: c0000000adbb9fd0, type: 5, owner: c0000000ad694f00, flag: INIT/-/-/0x00 (enqueue) XR-00000000-00000000 DID: 0000-0014-00000003 lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x3 res: 0xc0000000adcbf510, mode: NULL, lock_flag: 0x0 own: 0xc0000000ada5b940, sess: 0xc0000000ada5b940, proc: 0xc0000000ad694f00, prv: 0xc0000000adcbf520 slk: 0xc0000000ab40db98 这段内容是enqueue state object信息,发现该进程持有一个enqueue。 从这里的 res_flag: 0x3 可以判断:这个enqueue是global类型的,而且正在被持有,还没被free。 继续往下搜索,发现了如下类似的大量信息: SO: c0000000add5e1e8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00, event: 3, last message event: 3, last message waited event: 3, messages read: 0 channel: (c0000000add81108) kea interrupt signal channel scope: 5, event: 3, last mesage event: 0, publishers/subscribers: 1/2, messages published: 0 —————————————- SO: c0000000add5e0d8, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00, event: 3, last message event: 3326313, last message waited event: 3326313, messages read: 3324562 channel: (c0000000add7ee18) kxfp control signal channel scope: 5, event: 3326313, last mesage event: 3326313, publishers/subscribers: 4/2, messages published: 3324562 这里的type值为11,代表的意思是:Broadcast Handle(即广播消息句柄)。难怪做oradebug的时候,发现这个lck进程 阻塞了451个objects,而且等待居然都是rdbms ipc message。 继续往下搜索,发现了如下关键信息: —————————————- SO: c0000000ada5b940, type: 4, owner: c0000000ad694f00, flag: INIT/-/-/0x00 (session) sid: 537 trans: 0000000000000000, creator: c0000000ad694f00, flag: (51) USR/- BSY/-/-/-/-/- DID: 0000-0014-00000003, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'rdbms ipc message' blocking sess=0x0000000000000000 seq=43421 wait_time=0 seconds since wait started=1 timeout=12c, =0, =0 Dumping Session Wait History for 'ksxr poll remote instances' count=1 wait_time=7 =0, =0, =0 for 'ksxr poll remote instances' count=1 wait_time=7 =0, =0, =0 for 'rdbms ipc message' count=1 wait_time=1906819 timeout=c2, =0, =0 for 'ksxr poll remote instances' count=1 wait_time=9 =0, =0, =0 for 'rdbms ipc message' count=1 wait_time=12 timeout=c2, =0, =0 for 'rdbms ipc message' count=1 wait_time=1041870 timeout=12c, =0, =0 for 'ksxr poll remote instances' count=1 wait_time=7 =0, =0, =0 for 'ksxr poll remote instances' count=1 wait_time=10 =0, =0, =0 for 'rdbms ipc message' count=1 wait_time=807692 timeout=52, =0, =0 for 'rdbms ipc message' count=1 wait_time=27 timeout=52, =0, =0 temporary object counter: 0 —————————————- UOL used : 0 locks(used=0, free=0) KGX Atomic Operation Log c0000000aca2e6e8 Mutex 0000000000000000(0, 0) idn 0 oper NONE Cursor Pin uid 537 efd 3 whr 11 slp 0 KGX Atomic Operation Log c0000000aca2e730 Mutex 0000000000000000(0, 0) idn 0 oper NONE Library Cache uid 537 efd 0 whr 0 slp 0 KGX Atomic Operation Log c0000000aca2e778 Mutex 0000000000000000(0, 0) idn 0 oper NONE Library Cache uid 537 efd 0 whr 0 slp 0 —————————————- SO: c0000000a1a7fe78, type: 41, owner: c0000000ada5b940, flag: INIT/-/-/0x00 (dummy) nxc=0, nlb=0 —————————————- SO: c0000000add5ab28, type: 11, owner: c0000000ad694f00, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad694f00, event: 20, last message event: 26, last message waited event: 26, messages read: 1 channel: (c0000000add7a2d8) system events broadcast channel scope: 2, event: 204545, last mesage event: 26, publishers/subscribers: 0/75, messages published: 1 —————————————- SO: c0000000ab2ba708, type: 19, owner: c0000000ad694f00, flag: INIT/-/-/0x00 GES MSG BUFFERS: st=emp chunk=0x0000000000000000 hdr=0x0000000000000000 lnk=0x0000000000000000 flags=0x0 inc=36 outq=0 sndq=0 opid=20 prmb=0x0 mbg[i]=(3277535 648561) mbg[b]=(0 0) mbg[r]=(0 0) fmq[i]=(4 0) fmq[b]=(0 0) fmq[r]=(0 0) mop[s]=648561 mop[q]=3277535 pendq=0 zmbq=0 nonksxp_recvs=0 ————process 0xc0000000ab2ba708——————– proc version : 0 Local node : 1 pid : 27684 lkp_node : 1 svr_mode : 0 proc state : KJP_NORMAL Last drm hb acked : 0 Total accesses : 221605743 Imm. accesses : 196696635 Locks on ASTQ : 0 Locks Pending AST : 113 Granted locks : 56081 AST_Q: PENDING_Q: lp 0xc0000000acb505c0 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV] master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x9565a convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK lp 0xc00000007a220300 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV] master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK lp 0xc000000074ac4c38 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV] master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a convert opt KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK lp 0xc0000000748ca940 gl KJUSERNL rl KJUSERPR rp 0xc000000094eeda88 [0x2345][0xffffffff],[IV] master 1 pid 27684 bast 0 rseq 1 mseq 0 history 0x565a565a 这里的rp 0xc000000094eeda88 我猜应该是enqueue等待的资源名,于是搜索resouce 0xc000000094eeda88字符串,发现了如下信息: ———-enqueue 0xc00000007a3cb2e0———————— lock version : 1 Owner node : 1 grant_level : KJUSERNL req_level : KJUSERPR bast_level : KJUSERNL notify_func : 0x0000000000000000 resp : 0xc000000094eeda88 procp : 0xc0000000ab2ba708 pid : 27684 proc version : 0 oprocp : 0x0000000000000000 opid : 0 group lock owner : 0x0000000000000000 xid : 0000-0000-00000000 dd_time : 0.0 secs dd_count : 0 timeout : 0.0 secs On_timer_q? : N On_dd_q? : N lock_state : CONVERTING Open Options : KJUSERPROCESS_OWNED KJUSERCLIENTLOCK Convert options : KJUSERGETVALUE KJUSERNODEADLOCKWAIT KJUSERNODEADLOCKBLOCK History : 0x565a565a Msg_Seq : 0x0 res_seq : 1 valblk : 0x000000020001fcc00000000000000000 . ———-resource 0xc000000094eeda88———————- resname : [0x2345][0xffffffff],[IV] Local node : 1 dir_node : 1 master_node : 1 hv idx : 35 hv last r.inc : 6 current inc : 36 hv status : 0 hv master : 1 open options : grant_bits : KJUSERNL KJUSERPR grant mode : KJUSERNL KJUSERCR KJUSERCW KJUSERPR KJUSERPW KJUSEREX count : 114 0 0 333 0 0 val_state : KJUSERVS_VALUE valblk : 0x000000020001fcc00000000000000000 . access_node : 1 vbreq_state : 0 state : x8 resp : 0xc000000094eeda88 On Scan_q? : N Total accesses: 4524902 Imm. accesses: 1948489 Granted_locks : 333 Cvting_locks : 114 总的来说,给人的感觉似乎lck是罪魁祸首,其实不竟然。 用awk脚本格式化了这个systemstate dump后,得到下面的结果: System State 1 ~~~~~~~~~~~~~~~~ 1: 2: waiting for 'pmon timer' wait 3: waiting for 'DIAG idle wait' wait 4: waiting for 'rdbms ipc message' wait 5: waiting for 'rdbms ipc message' wait 6: waiting for 'ges remote message' wait 7: waiting for 'gcs remote message' wait 8: waiting for 'gcs remote message' wait 9: waiting for 'rdbms ipc message' wait 10: waiting for 'rdbms ipc message' wait 11: waiting for 'rdbms ipc message' wait 12: waiting for 'rdbms ipc message' wait 13: waiting for 'smon timer' wait 14: waiting for 'rdbms ipc message' wait 15: waiting for 'rdbms ipc message' wait 16: waiting for 'rdbms ipc message' wait 17: waiting for 'rdbms ipc message' wait 18: 19: 20: waiting for 'rdbms ipc message' wait 21: waiting for 'SQL*Net message from client' wait 22: waiting for 'rdbms ipc message' wait 23: waiting for 'rdbms ipc message' wait 24: waiting for 'SQL*Net message from client' wait 25: waiting for 'Streams AQ: qmn coordinator idle wait' wait 26: waiting for 'SQL*Net message from client' wait 27: waiting for 'SQL*Net message from client' wait Cmd: PL/SQL Execute 28: for 'Streams AQ: waiting for time management or cleanup tasks' wait 29: 30: waiting for 'SQL*Net message from client' wait 31: waiting for 'SQL*Net message from client' wait 32: waiting for 'SQL*Net message from client' wait 33: waiting for 'SQL*Net message from client' wait 34: waiting for 'SQL*Net message from client' wait 35: waiting for 'SQL*Net message from client' wait 36: waiting for 'SQL*Net message from client' wait 37: waiting for 'SQL*Net message from client' wait 38: waiting for 'SQL*Net message from client' wait 39: waiting for 'SQL*Net message from client' wait Cmd: Insert 40: waiting for 'SQL*Net message from client' wait 41: waiting for 'SQL*Net message from client' wait 42: last wait for 'SQL*Net message from client' 43: waiting for 'SQL*Net message from client' wait 45: waiting for 'Streams AQ: qmn slave idle wait' wait 46: waiting for 'SQL*Net message from client' wait Cmd: Insert 47: waiting for 'SQL*Net message from client' wait 48: waiting for 'SQL*Net message from client' wait 49: waiting for 'SQL*Net message from client' wait 50: waiting for 'SQL*Net message from client' wait 52: waiting for 'SQL*Net message from client' wait 53: waiting for 'SQL*Net message from client' wait 54: waiting for 'SQL*Net message from client' wait 55: waiting for 'SQL*Net message from client' wait 56: waiting for 'SQL*Net message from client' wait 58: waiting for 'SQL*Net message from client' wait 59: waiting for 'SQL*Net message from client' wait 60: waiting for 'SQL*Net message from client' wait 61: waiting for 'SQL*Net message from client' wait 65: waiting for 'SQL*Net message from client' wait 66: waiting for 'SQL*Net message from client' wait 90: waiting for 'SQL*Net message from client' wait 91: waiting for 'SQL*Net message from client' wait 92: waiting for 'SQL*Net message from client' wait 93: waiting for 'SQL*Net message from client' wait 94: waiting for 'SQL*Net message from client' wait 95: waiting for 'SQL*Net message from client' wait 96: waiting for 'SQL*Net message from client' wait 106:waiting for 'SQL*Net message from client' wait 107:waiting for 'SQL*Net message from client' wait 108:waiting for 'SQL*Net message from client' wait 109:waiting for 'SQL*Net message from client' wait 110:waiting for 'SQL*Net message from client' wait 111:waiting for 'SQL*Net message from client' wait 112:waiting for 'SQL*Net message from client' wait 113:waiting for 'SQL*Net message from client' wait 114:waiting for 'SQL*Net message from client' wait 115:waiting for 'SQL*Net message from client' wait 116:waiting for 'SQL*Net message from client' wait 117:waiting for 'SQL*Net message from client' wait 120:waiting for 'SQL*Net message from client' wait NO BLOCKING PROCESSES FOUND 881983 Lines Processed. 从分析结果来看,断定pmon进程才是真正的罪魁祸首,搜索pmon关键字: PROCESS 2: —————————————- SO: c0000000ad68c020, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00 (process) Oracle pid=2, calls cur/top: c0000000adab8098/c0000000adab8098, flag: (e) SYSTEM int error: 0, call error: 0, sess error: 0, txn error 0 (post info) last post received: 0 0 48 last post received-location: ksoreq_reply last process to post me: c0000000ad692750 4 2 last post sent: 0 0 24 last post sent-location: ksasnd last process posted by me: c0000000ad68d000 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: c0000000ad784f20 O/S info: user: oracle, term: UNKNOWN, ospid: 27626 OSD pid info: Unix process pid: 27626, image: oracle@uamdb2 (PMON) ###### 省略部分信息 ###### —————————————- SO: c0000000add586a8, type: 11, owner: c0000000ad68c020, flag: INIT/-/-/0x00 (broadcast handle) flag: (2) ACTIVE SUBSCRIBER, owner: c0000000ad68c020, event: 1, last message event: 1, last message waited event: 1, messages read: 0 channel: (c0000000add7e4b0) scumnt mount lock scope: 1, event: 15, last mesage event: 0, publishers/subscribers: 0/14, messages published: 0 —————————————- SO: c0000000ada73a90, type: 4, owner: c0000000ad68c020, flag: INIT/-/-/0x00 (session) sid: 555 trans: 0000000000000000, creator: c0000000ad68c020, flag: (51) USR/- BSY/-/-/-/-/- DID: 0002-0002-00000004, short-term DID: 0000-0000-00000000 txn branch: 0000000000000000 oct: 0, prv: 0, sql: 0000000000000000, psql: 0000000000000000, user: 0/SYS service name: SYS$BACKGROUND waiting for 'pmon timer' blocking sess=0x0000000000000000 seq=50 wait_time=0 seconds since wait started=75789 duration=12c, =0, =0 Dumping Session Wait History for 'pmon timer' count=1 wait_time=2929152 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929462 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929351 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929175 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2928083 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929290 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929509 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929262 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929283 duration=12c, =0, =0 for 'pmon timer' count=1 wait_time=2929525 duration=12c, =0, =0 temporary object counter: 0 |
2929152/3600/24=33天左右。换句话说,pmon进程在1个月前就出现异常了,同事也说5月26号的时候重启过,
后来似乎就有问题了,掐指一算,时间似乎刚刚对上。
最后我在中午申请了一点停机时间,我们进行了处理,首先我kill了diag进程,12g的磁盘空间得到了释放,
最开始我断定是lck进程导致的,于是kill了该进程,结果直接导致该节点重启了,不过还好的是,23s就重启完成了。
CASE 4.
这个是昨天的一天case,某个客户方工程师的一套hp 9204系统(HA)环境,记得是我以前部署的。
客户工程师给listener设置密码以后,用脚本来启动监听,发现报错,非常的奇怪,信息如下:
TNS-01169: The listener has not recognized the password
当时他的脚本如下:
1 2 3 4 5 6 7 8 9 10 |
su - oracle<<EOF export ORACLE_SID=roger lsnrctl set current_listener listener1 set password roger status EOF su - oracle<<EOF ps -ef | grep ora EOF |
但看脚本,似乎没啥问题,最后我才发现是有问题,那也是折腾1个小时之后才发现的,几个同事一起看,都没发现问题的所在。
老实说,我也偶然发现的,记得快下班的时候,想起了老白的一句话: 像oracle一样思考
想到这句话,突然另我茅塞顿开,我当时是这样认为的:
==> 既然我们要对listener设置密码,那么必然是为了安全性;
==> 通过set password roger 加到脚本中来启动,要是有读权限的用户不就能看到密码了吗?那还谈何安全性呢?
很明显,这里就自相矛盾了;
于是我尝试将脚本修改为如下:
1 2 3 4 5 6 7 8 |
su - oracle<<EOF export ORACLE_SID=roger lsnrctl set current_listener listener1 set password roger -- roger修改为密文(即加密后listener.ora中的值) status exit EOF |
再次执行,发现居然ok了,最后我又想,既然是要输入加密后的密码而不是明文,那么metalink文档应该有明确的说明啊;
记得在开始遇到这个问题之后,我也看了metalink文档
How To Start/Stop A Password Protected Listener From A Script [ID 361919.1]
但是只是扫了一眼,没有引起重视,最后解决问题以后,我再次看了下,原来如此:
set password encrypted_password — 其实文档是明确说了,要输入加密后的密码
这当然说明了我当时太马虎了,不过也确实容易忽略。
补充一下,如果在10g中进行类似的测试,本机操作需要设置参数LOCAL_OS_AUTHENTICATION_
5 Responses to “最近一周遇到的几个case”
roger:
我给listener设置了密码,
PASSWORDS_LISTENER = 6D7AA003392C436A(明文为:123456)
然后使用
lsnrctl<>listener.log
set current_listener listener
set password 123456
status
EOF
测试成功
修改为:
lsnrctl<>listener.log
set current_listener listener
set password 6D7AA003392C436A
status
EOF
提示:
TNS-01169: The listener has not recognized the password
和你说的结果相反
oracle 版本?
Have you seen this new plugin from online marketing genius Daniel Tan known as SEOPressor ? I have already been implementing it for the past couple weeks and I now have five of my new blogs on the first page of google for highly competetive terms! I would suggest you try it out here: SEOPressor
Impressive blog, the only thing I noticed is your on-page SEO might use just a bit of work and then you would definitely be achieving a high position in google in no time! Have you used the SEOPressor plugin? It’s extremely effective and user friendly. Download it here. Just thought I’d share. Good luck to you.
why dont you add a mailing list to your blog?
Leave a Reply
You must be logged in to post a comment.