gc freelist问题再现
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: gc freelist问题再现
今天一大早某客户在vx群反馈其中一套系统早上上班时间出现了一段时间的hung,后面自动恢复了。从客户提供的ASH数据来看,问题相对清晰:
Slot Time (Duration) | Slot Count | Event | Event Count | % Event |
---|---|---|---|---|
09:26:01 (2.0 min) | 56,948 | gc freelist | 42,070 | 9.28 |
09:26:01 (2.0 min)
|
56948
|
read by other session | 9,985 | 2.20 |
09:26:01 (2.0 min)
|
56948
|
enq: HW – contention | 3,332 | 0.73 |
09:28:00 (2.0 min) | 59,045 | gc freelist | 43,374 | 9.56 |
09:28:00 (2.0 min)
|
59045
|
read by other session | 10,188 | 2.25 |
09:28:00 (2.0 min)
|
59045
|
enq: HW – contention | 3,773 | 0.83 |
09:30:00 (2.0 min) | 59,432 | gc freelist | 43,156 | 9.52 |
09:30:00 (2.0 min)
|
59432
|
read by other session | 9,253 | 2.04 |
09:30:00 (2.0 min)
|
59432
|
enq: HW – contention | 5,238 | 1.16 |
09:32:00 (2.0 min) | 59,812 | gc freelist | 43,037 | 9.49 |
09:32:00 (2.0 min)
|
59812
|
read by other session | 9,296 | 2.05 |
09:32:00 (2.0 min)
|
59812
|
enq: HW – contention | 5,449 | 1.20 |
09:34:00 (2.0 min) | 60,775 | gc freelist | 43,675 | 9.63 |
09:34:00 (2.0 min)
|
60775
|
read by other session | 9,619 | 2.12 |
09:34:00 (2.0 min)
|
60775
|
enq: HW – contention | 5,428 | 1.20 |
09:36:00 (2.0 min) | 63,090 | gc freelist | 44,403 | 9.79 |
09:36:00 (2.0 min)
|
63090
|
read by other session | 11,003 | 2.43 |
09:36:00 (2.0 min)
|
63090
|
enq: HW – contention | 5,520 | 1.22 |
09:38:00 (2.0 min) | 63,576 | gc freelist | 44,636 | 9.84 |
09:38:00 (2.0 min)
|
63576
|
read by other session | 11,286 | 2.49 |
09:38:00 (2.0 min)
|
63576
|
enq: HW – contention | 5,474 | 1.21 |
09:40:00 (1.0 min) | 30,798 | gc freelist | 20,947 | 4.62 |
09:40:00 (1.0 min)
|
30798
|
read by other session | 5,530 | 1.22 |
09:40:00 (1.0 min)
|
30798
|
enq: HW – contention | 2,688 | 0.59 |
之前写过一篇关于gc freelist等待的文章。实际上该等待事件并不多见,如果当你发现top event中出现该等待,那么多半是遇到了Bug。
我们回到这个案例本身,实际上从ASH数据来看,无非就是gc freelist,enq:HW – contention,read by other session. 从直观感觉我认为可能大概率出问题在gc freelist,从而导致整个集群都非常缓慢,近似于hung的状态。当然,当其中一个节点都出现了大量gc freelist,导致整个系统很慢的情况下,那么另外一个节点ash 势必也是大量的锁等待。
Slot Time (Duration) | Slot Count | Event | Event Count | % Event |
---|---|---|---|---|
09:22:36 (1.4 min) | 6,506 | enq: HW – contention | 2,859 | 3.02 |
09:22:36 (1.4 min)
|
6506
|
enq: TX – row lock contention | 1,380 | 1.46 |
09:22:36 (1.4 min)
|
6506
|
gc buffer busy acquire | 1,128 | 1.19 |
09:24:00 (2.0 min) | 10,361 | enq: HW – contention | 4,266 | 4.51 |
09:24:00 (2.0 min)
|
10361
|
enq: TX – index contention | 2,679 | 2.83 |
09:24:00 (2.0 min)
|
10361
|
enq: TX – row lock contention | 2,452 | 2.59 |
09:26:00 (2.0 min) | 11,352 | enq: HW – contention | 4,550 | 4.81 |
09:26:00 (2.0 min)
|
11352
|
enq: TX – index contention | 3,030 | 3.20 |
09:26:00 (2.0 min)
|
11352
|
enq: TX – row lock contention | 2,802 | 2.96 |
09:28:00 (2.0 min) | 12,415 | enq: HW – contention | 5,000 | 5.29 |
09:28:00 (2.0 min)
|
12415
|
enq: TX – index contention | 3,411 | 3.61 |
09:28:00 (2.0 min)
|
12415
|
enq: TX – row lock contention | 3,078 | 3.25 |
09:30:00 (2.0 min) | 13,300 | enq: HW – contention | 5,271 | 5.57 |
09:30:00 (2.0 min)
|
13300
|
enq: TX – index contention | 3,184 | 3.37 |
09:30:00 (2.0 min)
|
13300
|
enq: TX – row lock contention | 3,079 | 3.25 |
09:32:00 (2.0 min) | 13,638 | enq: HW – contention | 5,423 | 5.73 |
09:32:00 (2.0 min)
|
13638
|
enq: TX – row lock contention | 3,296 | 3.48 |
09:32:00 (2.0 min)
|
13638
|
gc buffer busy acquire | 3,216 | 3.40 |
09:34:00 (2.0 min) | 14,283 | enq: HW – contention | 5,647 | 5.97 |
09:34:00 (2.0 min)
|
14283
|
enq: TX – row lock contention | 3,678 | 3.89 |
09:34:00 (2.0 min)
|
14283
|
gc buffer busy acquire | 3,307 | 3.50 |
09:36:00 (1.7 min) | 12,751 | enq: HW – contention | 5,083 | 5.37 |
09:36:00 (1.7 min)
|
12751
|
enq: TX – row lock contention | 3,327 | 3.52 |
09:36:00 (1.7 min)
|
12751
|
gc buffer busy acquire | 3,108 | 3.29 |
因此这个时候看ash已经没有什么用了。从目前判断问题主要在节点2上,因此我来看下客户提供的节点2故障时间段内的一个diag trace:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
SO: 0x3b02491fd8, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3b02491fd8, name=process, file=ksu.h LINE:12721, pg=0 (process) Oracle pid:903, ser:142, calls cur/top: 0x27a2ded4f8/0x27a874fa70 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 184 0 2 last post received-location: ksl2.h LINE:2374 ID:kslpsr last process to post me: 0x39c2424aa0 1 6 last post sent: 0 0 183 last post sent-location: kcl2.h LINE:4436 ID:kclnfndnewm last process posted by me: 0x3b023aa048 1 6 (latch info) wait_event=184 bits=0x0 Process Group: DEFAULT, pseudo proc: 0x39c262fe68 O/S info: user: grid, term: UNKNOWN, ospid: 52182 OSD pid info: Unix process pid: 52182, image: oracle@xxxxxdb2 ---------------------------------------- SO: 0x3a0ca21218, type: 10, owner: 0x3b02491fd8, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x3b02491fd8, name=FileOpenBlock, file=ksfd.h LINE:6688, pg=0 (FOB) 0x3a0ca21218 flags=2560 fib=0x39cc6c67a8 incno=8 pending i/o cnt=0 fname=+DATA/ythdb/datafile/system.256.1048801851 fno=1 lblksz=8192 fsiz=651520 ---------------------------------------- SO: 0x39e29be138, type: 4, owner: 0x3b02491fd8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x3b02491fd8, name=session, file=ksu.h LINE:12729, pg=0 (session) sid: 1254 ser: 26125 trans: 0x387e16b0b0, creator: 0x3b02491fd8 flags: (0x100041) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40009) -/-/INC DID: , short-term DID: txn branch: (nil) edition#: 0 oct: 2, prv: 0, sql: 0x3a9dc80b10, psql: 0x3a9dc80b10, user: 0/SYS ksuxds FALSE at location: 0 service name: ythdb client details: O/S info: user: root, term: unknown, ospid: 1234 machine: gfmis-pay-sendglsaccount-85b456d76c-zzctl program: JDBC Thin Client application name: JDBC Thin Client, hash value=2546894660 Current Wait Stack: 0: waiting for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4839 seq_num=4840 snap_id=1 wait times: snap=0.001568 sec, exc=0.001568 sec, total=0.001568 sec wait times: max=0.200000 sec, heur=16 min 8 sec wait counts: calls=1 os=1 in_wait=1 iflags=0x5a8 There are 64 sessions blocked by this session. Dumping one waiter: inst: 2, sid: 1289, ser: 25931 wait event: 'enq: HW - contention' p1: 'name|mode'=0x48570006 p2: 'table space #'=0x0 p3: 'block'=0x400978 row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 min_blocked_time: 943 secs, waiter_cache_ver: 38771 Wait State: fixed_waits=0 flags=0x22 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000100 sec since current wait 0: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4838 seq_num=4839 snap_id=1 wait times: snap=0.017854 sec, exc=0.017854 sec, total=0.017854 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000090 sec of elapsed time 1: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4837 seq_num=4838 snap_id=1 wait times: snap=0.200893 sec, exc=0.200893 sec, total=0.200893 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000105 sec of elapsed time 2: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4836 seq_num=4837 snap_id=1 wait times: snap=0.200923 sec, exc=0.200923 sec, total=0.200923 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000078 sec of elapsed time 3: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4835 seq_num=4836 snap_id=1 wait times: snap=0.200908 sec, exc=0.200908 sec, total=0.200908 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000092 sec of elapsed time 4: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4834 seq_num=4835 snap_id=1 wait times: snap=0.200906 sec, exc=0.200906 sec, total=0.200906 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000094 sec of elapsed time 5: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4833 seq_num=4834 snap_id=1 wait times: snap=0.200900 sec, exc=0.200900 sec, total=0.200900 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000092 sec of elapsed time 6: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4832 seq_num=4833 snap_id=1 wait times: snap=0.200918 sec, exc=0.200918 sec, total=0.200918 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000088 sec of elapsed time 7: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4831 seq_num=4832 snap_id=1 wait times: snap=0.200907 sec, exc=0.200907 sec, total=0.200907 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000094 sec of elapsed time 8: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4830 seq_num=4831 snap_id=1 wait times: snap=0.200883 sec, exc=0.200883 sec, total=0.200883 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000113 sec of elapsed time 9: waited for 'gc freelist' =0x0, =0x0, =0x0 wait_id=4829 seq_num=4830 snap_id=1 wait times: snap=0.200897 sec, exc=0.200897 sec, total=0.200897 sec wait times: max=0.200000 sec wait counts: calls=1 os=1 occurred after 0.000103 sec of elapsed time |
我们可以清楚的看到,实际上大量session被该会话所阻塞,而该session在等待gc freelist,而且已经等待了10多分钟。
进一步分析该session的trace,可以看到执行情况,发现跟审计有关:
1 2 3 4 5 6 7 |
User=0x39e29be138 Session=0x39e29be138 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=637acf85 LibraryHandle: Address=0x3a9dc80b10 Hash=367d06a6 LockMode=N PinMode=0 LoadLockMode=0 Status=VALD ObjectName: Name=insert into sys.aud$( sessionid,entryid,statement,ntimestamp#, userid,userhost,terminal,action#,returncode, obj$creator,obj$name,auth$privileges,auth$grantee, new$owner,new$name,ses$actions,ses$tid,logoff$pread, logoff$lwrite,logoff$dead,comment$text,spare1,spare2, priv$used,clientid,sessioncpu,proxy$sid,user$guid, instance#,process#,xid,scn,auditid, sqlbind,sqltext,obj$edition,dbid) values(:1,:2,:3,SYS_EXTRACT_UTC(SYSTIMESTAMP), :4,:5,:6,:7,:8, :9,:10,:11,:12, :13,:14,:15,:16,:17, :18,:1 FullHashValue=4ba8a6dafb69204e4de1216b367d06a6 Namespace=SQL AREA(00) Type=CURSOR(00) Identifier=914163366 OwnerIdn=0 Statistics: InvalidationCount=0 ExecutionCount=299986 LoadCount=20 ActiveLocks=869 TotalLockCount=299437 TotalPinCount=2 Counters: BrokenCount=1 RevocablePointer=1 KeepDependency=2 Version=0 BucketInUse=37261 HandleInUse=37261 HandleReferenceCount=0 |
从现象来看,跟Oracle mos文章 INSERT INTO SYS.AUD$ Is Blocking Due to Waiting on ‘gc freelist’ (Doc ID 2207798.1) 描述很像的。
同时也参考了几篇类似文章例如Unified And/Or Traditional Auditing Inserts Waiting On GC FREELIST (In ASSM Tablespace) (Doc ID 2596033.1)
总的来讲,建议是调整如下参数:_gc_element_percent 建议设置为140~200之间(默认值是110).
当然,根据本文这里的case,实际上也可以通过move aud$到其他表空间来解决或关闭审计来绕过这个问题。
Leave a Reply
You must be logged in to post a comment.