Oracle instance hung due to gc buffer busy acquire/gc current request
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: Oracle instance hung due to gc buffer busy acquire/gc current request
一客户某系统反馈在上周深夜出现严重性能问题,整个系统出现了短暂hung,导致无法使用。通过用户提供的awr元数据,这里做了简单分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
SQL> select a.instance_number, 2 TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), 3 b.EVENT_NAME, 4 count(1) 5 from WRH$_ACTIVE_SESSION_HISTORY a,WRH$_EVENT_NAME b 6 where a.dbid=3866720943 7 and a.event_id=b.event_id 8 and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') > '2022-11-18 23:23:00' 9 and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') < '2022-11-18 23:33:00' 10 ---and b.WAIT_CLASS<> 'Idle' and b.event_name not like '%SQL*Net%' 11 12 and b.event_name not like 'gc%%' group by a.instance_number, TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), 13 14 15 b.EVENT_NAME 16 order by 1, 2; INSTANCE_NUMBER TO_CHAR(A.SAMPLE_TIME,'YYYY-MM-DDHH24: EVENT_NAME COUNT(1) --------------- -------------------------------------- ---------------------------------------- ---------- 1 2022-11-18 23:27:59 CGS wait for IPC msg 2 1 2022-11-18 23:27:59 buffer busy waits 58 1 2022-11-18 23:27:59 db file scattered read 2 1 2022-11-18 23:27:59 db file sequential read 2 1 2022-11-18 23:27:59 enq: SQ - contention 308 1 2022-11-18 23:27:59 enq: TX - index contention 150 1 2022-11-18 23:27:59 gc buffer busy acquire 400 1 2022-11-18 23:27:59 gc current block busy 4 1 2022-11-18 23:27:59 gc current request 378 1 2022-11-18 23:27:59 row cache lock 78 1 2022-11-18 23:28:09 buffer busy waits 58 1 2022-11-18 23:28:09 enq: SQ - contention 308 1 2022-11-18 23:28:09 enq: TX - index contention 150 1 2022-11-18 23:28:09 gc buffer busy acquire 402 1 2022-11-18 23:28:09 gc cr multi block request 2 1 2022-11-18 23:28:09 gc current block busy 4 1 2022-11-18 23:28:09 gc current request 378 1 2022-11-18 23:28:09 row cache lock 78 1 2022-11-18 23:28:19 buffer busy waits 58 1 2022-11-18 23:28:19 db file scattered read 2 1 2022-11-18 23:28:19 enq: SQ - contention 308 1 2022-11-18 23:28:19 enq: TX - index contention 150 1 2022-11-18 23:28:19 gc buffer busy acquire 404 1 2022-11-18 23:28:19 gc current block busy 6 1 2022-11-18 23:28:19 gc current request 378 1 2022-11-18 23:28:19 row cache lock 78 1 2022-11-18 23:28:30 buffer busy waits 58 1 2022-11-18 23:28:30 db file scattered read 2 1 2022-11-18 23:28:30 enq: SQ - contention 308 1 2022-11-18 23:28:30 enq: TX - index contention 150 1 2022-11-18 23:28:30 gc buffer busy acquire 416 1 2022-11-18 23:28:30 gc current block busy 16 1 2022-11-18 23:28:30 gc current request 378 1 2022-11-18 23:28:30 row cache lock 78 1 2022-11-18 23:28:40 buffer busy waits 58 1 2022-11-18 23:28:40 db file scattered read 2 1 2022-11-18 23:28:40 enq: SQ - contention 308 1 2022-11-18 23:28:40 enq: TX - index contention 150 1 2022-11-18 23:28:40 gc buffer busy acquire 416 1 2022-11-18 23:28:40 gc current request 394 1 2022-11-18 23:28:40 row cache lock 78 1 2022-11-18 23:28:50 buffer busy waits 58 1 2022-11-18 23:28:50 control file sequential read 2 1 2022-11-18 23:28:50 db file scattered read 2 1 2022-11-18 23:28:50 enq: SQ - contention 310 1 2022-11-18 23:28:50 enq: TX - index contention 150 1 2022-11-18 23:28:50 gc buffer busy acquire 414 1 2022-11-18 23:28:50 gc current block busy 16 1 2022-11-18 23:28:50 gc current request 378 1 2022-11-18 23:28:50 reliable message 2 1 2022-11-18 23:28:50 row cache lock 78 1 2022-11-18 23:29:01 buffer busy waits 58 1 2022-11-18 23:29:01 db file scattered read 2 1 2022-11-18 23:29:01 enq: SQ - contention 312 1 2022-11-18 23:29:01 enq: TX - index contention 150 1 2022-11-18 23:29:01 gc buffer busy acquire 416 1 2022-11-18 23:29:01 gc current block busy 18 1 2022-11-18 23:29:01 gc current request 378 1 2022-11-18 23:29:01 row cache lock 78 1 2022-11-18 23:29:12 SQL*Net message to client 2 1 2022-11-18 23:29:12 buffer busy waits 58 1 2022-11-18 23:29:12 db file scattered read 2 1 2022-11-18 23:29:12 enq: SQ - contention 314 1 2022-11-18 23:29:12 enq: TX - index contention 150 1 2022-11-18 23:29:12 gc buffer busy acquire 416 1 2022-11-18 23:29:12 gc current block busy 20 1 2022-11-18 23:29:12 gc current request 378 1 2022-11-18 23:29:12 row cache lock 78 1 2022-11-18 23:29:22 buffer busy waits 58 1 2022-11-18 23:29:22 db file scattered read 2 1 2022-11-18 23:29:22 enq: SQ - contention 314 1 2022-11-18 23:29:22 enq: TX - index contention 150 1 2022-11-18 23:29:22 gc buffer busy acquire 416 1 2022-11-18 23:29:22 gc current block 2-way 2 1 2022-11-18 23:29:22 gc current block busy 20 1 2022-11-18 23:29:22 gc current request 378 1 2022-11-18 23:29:22 row cache lock 78 1 2022-11-18 23:29:32 buffer busy waits 58 1 2022-11-18 23:29:32 db file scattered read 2 1 2022-11-18 23:29:32 enq: SQ - contention 316 1 2022-11-18 23:29:32 enq: TX - index contention 150 1 2022-11-18 23:29:32 gc buffer busy acquire 416 1 2022-11-18 23:29:32 gc current block busy 20 1 2022-11-18 23:29:32 gc current request 378 1 2022-11-18 23:29:32 row cache lock 78 1 2022-11-18 23:29:43 buffer busy waits 58 1 2022-11-18 23:29:43 db file scattered read 2 1 2022-11-18 23:29:43 enq: SQ - contention 318 1 2022-11-18 23:29:43 enq: TX - index contention 150 1 2022-11-18 23:29:43 gc buffer busy acquire 418 1 2022-11-18 23:29:43 gc cr multi block request 2 1 2022-11-18 23:29:43 gc current block busy 26 1 2022-11-18 23:29:43 gc current request 380 1 2022-11-18 23:29:43 row cache lock 78 1 2022-11-18 23:29:53 buffer busy waits 58 1 2022-11-18 23:29:53 control file sequential read 4 1 2022-11-18 23:29:53 db file scattered read 2 1 2022-11-18 23:29:53 enq: SQ - contention 318 1 2022-11-18 23:29:53 enq: TX - index contention 150 1 2022-11-18 23:29:53 gc buffer busy acquire 418 1 2022-11-18 23:29:53 gc current block busy 26 1 2022-11-18 23:29:53 gc current request 380 1 2022-11-18 23:29:53 row cache lock 78 1 2022-11-18 23:30:04 buffer busy waits 58 1 2022-11-18 23:30:04 db file scattered read 2 1 2022-11-18 23:30:04 enq: SQ - contention 318 1 2022-11-18 23:30:04 enq: TX - index contention 150 1 2022-11-18 23:30:04 gc buffer busy acquire 418 1 2022-11-18 23:30:04 gc current block busy 26 1 2022-11-18 23:30:04 gc current request 378 1 2022-11-18 23:30:04 row cache lock 78 1 2022-11-18 23:30:14 buffer busy waits 58 1 2022-11-18 23:30:14 db file scattered read 2 1 2022-11-18 23:30:14 enq: SQ - contention 318 1 2022-11-18 23:30:14 enq: TX - index contention 150 1 2022-11-18 23:30:14 gc buffer busy acquire 418 1 2022-11-18 23:30:14 gc current block busy 26 1 2022-11-18 23:30:14 gc current request 378 1 2022-11-18 23:30:14 gc current retry 2 1 2022-11-18 23:30:14 row cache lock 78 1 2022-11-18 23:30:24 buffer busy waits 58 1 2022-11-18 23:30:24 db file scattered read 2 1 2022-11-18 23:30:24 enq: SQ - contention 318 1 2022-11-18 23:30:24 enq: TX - index contention 150 1 2022-11-18 23:30:24 gc buffer busy acquire 418 1 2022-11-18 23:30:24 gc current block busy 26 1 2022-11-18 23:30:24 gc current request 378 1 2022-11-18 23:30:24 gc current retry 2 1 2022-11-18 23:30:24 row cache lock 78 1 2022-11-18 23:30:35 buffer busy waits 58 1 2022-11-18 23:30:35 enq: SQ - contention 318 1 2022-11-18 23:30:35 enq: TX - index contention 150 1 2022-11-18 23:30:35 gc buffer busy acquire 418 1 2022-11-18 23:30:35 gc cr multi block request 2 1 2022-11-18 23:30:35 gc cr request 2 1 2022-11-18 23:30:35 gc current block busy 26 1 2022-11-18 23:30:35 gc current request 380 1 2022-11-18 23:30:35 row cache lock 80 1 2022-11-18 23:30:45 buffer busy waits 58 1 2022-11-18 23:30:45 db file scattered read 2 1 2022-11-18 23:30:45 enq: SQ - contention 318 1 2022-11-18 23:30:45 enq: TX - index contention 150 1 2022-11-18 23:30:45 gc buffer busy acquire 418 1 2022-11-18 23:30:45 gc cr block 2-way 2 1 2022-11-18 23:30:45 gc current block 2-way 2 1 2022-11-18 23:30:45 gc current block busy 26 1 2022-11-18 23:30:45 gc current request 380 1 2022-11-18 23:30:45 row cache lock 80 1 2022-11-18 23:30:55 buffer busy waits 58 1 2022-11-18 23:30:55 db file scattered read 2 1 2022-11-18 23:30:55 db file sequential read 2 1 2022-11-18 23:30:55 enq: SQ - contention 318 1 2022-11-18 23:30:55 enq: TX - index contention 150 1 2022-11-18 23:30:55 gc buffer busy acquire 418 1 2022-11-18 23:30:55 gc current block busy 26 1 2022-11-18 23:30:55 gc current request 380 1 2022-11-18 23:30:55 row cache lock 80 1 2022-11-18 23:31:06 buffer busy waits 58 1 2022-11-18 23:31:06 db file parallel write 2 1 2022-11-18 23:31:06 db file scattered read 2 1 2022-11-18 23:31:06 enq: SQ - contention 318 1 2022-11-18 23:31:06 enq: TX - index contention 150 1 2022-11-18 23:31:06 gc buffer busy acquire 418 1 2022-11-18 23:31:06 gc current block busy 26 1 2022-11-18 23:31:06 gc current request 378 1 2022-11-18 23:31:06 gc current retry 2 1 2022-11-18 23:31:06 row cache lock 80 1 2022-11-18 23:31:17 buffer busy waits 58 1 2022-11-18 23:31:17 enq: SQ - contention 318 1 2022-11-18 23:31:17 enq: TX - index contention 150 1 2022-11-18 23:31:17 gc buffer busy acquire 418 1 2022-11-18 23:31:17 gc cr multi block request 2 1 2022-11-18 23:31:17 gc current block busy 26 1 2022-11-18 23:31:17 gc current request 378 1 2022-11-18 23:31:17 gc current retry 2 1 2022-11-18 23:31:17 row cache lock 80 1 2022-11-18 23:31:27 buffer busy waits 58 1 2022-11-18 23:31:27 db file scattered read 2 1 2022-11-18 23:31:27 enq: SQ - contention 318 1 2022-11-18 23:31:27 enq: TX - index contention 150 1 2022-11-18 23:31:27 gc buffer busy acquire 418 1 2022-11-18 23:31:27 gc current block busy 26 1 2022-11-18 23:31:27 gc current request 378 1 2022-11-18 23:31:27 row cache lock 80 2 2022-11-18 23:30:25 buffer busy waits 346 2 2022-11-18 23:30:25 control file sequential read 2 2 2022-11-18 23:30:25 db file parallel write 4 2 2022-11-18 23:30:25 enq: FB - contention 2 2 2022-11-18 23:30:25 enq: SQ - contention 186 2 2022-11-18 23:30:25 enq: TX - index contention 2 2 2022-11-18 23:30:25 gc buffer busy release 32 2 2022-11-18 23:30:25 log file switch (checkpoint incomplete) 3560 2 2022-11-18 23:30:25 row cache lock 14 2 2022-11-18 23:30:25 write complete waits 46 2 2022-11-18 23:30:35 SQL*Net more data to client 2 2 2022-11-18 23:30:35 buffer busy waits 346 2 2022-11-18 23:30:35 control file sequential read 2 2 2022-11-18 23:30:35 db file parallel write 4 2 2022-11-18 23:30:35 enq: FB - contention 2 2 2022-11-18 23:30:35 enq: SQ - contention 186 2 2022-11-18 23:30:35 enq: TX - index contention 2 2 2022-11-18 23:30:35 gc buffer busy release 34 2 2022-11-18 23:30:35 log file switch (checkpoint incomplete) 3562 2 2022-11-18 23:30:35 row cache lock 14 2 2022-11-18 23:30:35 write complete waits 46 2 2022-11-18 23:30:46 buffer busy waits 346 2 2022-11-18 23:30:46 db file parallel write 4 2 2022-11-18 23:30:46 enq: FB - contention 2 2 2022-11-18 23:30:46 enq: SQ - contention 186 2 2022-11-18 23:30:46 enq: TX - index contention 2 2 2022-11-18 23:30:46 gc buffer busy release 34 2 2022-11-18 23:30:46 log file switch (checkpoint incomplete) 3572 2 2022-11-18 23:30:46 row cache lock 14 2 2022-11-18 23:30:46 write complete waits 46 2 2022-11-18 23:30:57 CSS operation: action 2 2 2022-11-18 23:30:57 buffer busy waits 346 2 2022-11-18 23:30:57 control file sequential read 2 2 2022-11-18 23:30:57 db file parallel write 4 2 2022-11-18 23:30:57 enq: FB - contention 2 2 2022-11-18 23:30:57 enq: SQ - contention 186 2 2022-11-18 23:30:57 enq: TX - index contention 2 2 2022-11-18 23:30:57 gc buffer busy release 32 2 2022-11-18 23:30:57 latch: redo writing 4 2 2022-11-18 23:30:57 log file switch (checkpoint incomplete) 3578 2 2022-11-18 23:30:57 row cache lock 14 2 2022-11-18 23:30:57 write complete waits 50 2 2022-11-18 23:31:07 buffer busy waits 346 2 2022-11-18 23:31:07 db file parallel write 2 2 2022-11-18 23:31:07 enq: FB - contention 2 2 2022-11-18 23:31:07 enq: SQ - contention 188 2 2022-11-18 23:31:07 enq: TX - index contention 2 2 2022-11-18 23:31:07 gc buffer busy release 32 2 2022-11-18 23:31:07 log file switch (checkpoint incomplete) 3584 2 2022-11-18 23:31:07 row cache lock 14 2 2022-11-18 23:31:07 write complete waits 50 2 2022-11-18 23:31:18 buffer busy waits 346 2 2022-11-18 23:31:18 control file sequential read 2 2 2022-11-18 23:31:18 db file parallel write 4 2 2022-11-18 23:31:18 enq: FB - contention 2 2 2022-11-18 23:31:18 enq: SQ - contention 192 2 2022-11-18 23:31:18 enq: TX - index contention 2 2 2022-11-18 23:31:18 gc buffer busy release 32 2 2022-11-18 23:31:18 log file switch (checkpoint incomplete) 3586 2 2022-11-18 23:31:18 row cache lock 14 2 2022-11-18 23:31:18 write complete waits 50 2 2022-11-18 23:31:29 buffer busy waits 346 2 2022-11-18 23:31:29 db file parallel write 4 2 2022-11-18 23:31:29 enq: FB - contention 2 2 2022-11-18 23:31:29 enq: SQ - contention 192 2 2022-11-18 23:31:29 enq: TX - index contention 2 2 2022-11-18 23:31:29 gc buffer busy release 32 2 2022-11-18 23:31:29 log file switch (checkpoint incomplete) 3586 2 2022-11-18 23:31:29 row cache lock 14 2 2022-11-18 23:31:29 write complete waits 50 254 rows selected. |
从上面的查询接过来有如下几个疑点:
1、节点1实例从23:27:59开始突然出现大量gc等待,以及enq: SQ – contention,包括buffer busy waits等。
2、23:23 — 23:27:59之间,节点1 无任何相关等待记录
3、用户反馈的日志切换无法完成即log file switch(checkpoint incomplete)在节点2的上是突然出现的,时间点是23:30:25
4、节点2在23:30之前到23:23分之间这7分钟,没有任何性能数据记录。
从上述等待的情况和规律,我们可以大致得到如下几个基本结论:
log file switch等待属于叠加现象,并非根本原因,因为在此之前系统已经运行极为缓慢了,其次节点1在27分突然出现大量gc等待,可排除SQL本身性能影响,如执行计划突然改变导致性能变差,进而产生大量gc,引发该问题。尽管如此,这里我仍然查了一下gc top SQL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
SQL> SQL> select a.instance_number, 2 TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), b.event_name, a.sql_id, count(1) from WRH$_ACTIVE_SESSION_HISTORY a,WRH$_EVENT_NAME b where a.dbid=3866720943 and a.event_id=b.event_id 3 4 5 6 7 8 9 and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') = '2022-11-18 23:27:59' 10 ---and TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss') < '2022-11-18 23:31:30' 11 and b.WAIT_CLASS<> 'Idle' 12 and b.event_name like 'gc%' 13 ---and a.INSTANCE_NUMBER=1 14 group by a.instance_number, 15 TO_CHAR(a.sample_time, 'yyyy-mm-dd hh24:mi:ss'), b.event_name, 16 17 a.sql_id order by 1, 2; 18 INSTANCE_NUMBER TO_CHAR(A.SAMPLE_TIME,'YYYY-MM-DDHH24: EVENT_NAME SQL_ID COUNT(1) --------------- -------------------------------------- ---------------------------------------- -------------------------- ---------- 1 2022-11-18 23:27:59 gc buffer busy acquire 070x0h38a9wdj 2 1 2022-11-18 23:27:59 gc buffer busy acquire 14tfmqd4n7nm5 2 1 2022-11-18 23:27:59 gc buffer busy acquire 161cd6mnw8ng6 6 1 2022-11-18 23:27:59 gc buffer busy acquire 311un5s5ug5yz 2 1 2022-11-18 23:27:59 gc buffer busy acquire 41uxzsq73sqya 4 1 2022-11-18 23:27:59 gc buffer busy acquire 42ck1jx53qubd 2 1 2022-11-18 23:27:59 gc buffer busy acquire 44xyf34792k2d 4 1 2022-11-18 23:27:59 gc buffer busy acquire 575n36005r5u3 2 1 2022-11-18 23:27:59 gc buffer busy acquire 5m068a02bdzm3 2 1 2022-11-18 23:27:59 gc buffer busy acquire 66cg7zfgzgkc2 4 1 2022-11-18 23:27:59 gc buffer busy acquire 6pyp8r5zvk2fp 4 1 2022-11-18 23:27:59 gc buffer busy acquire 6tzsyrszbuwj4 4 1 2022-11-18 23:27:59 gc buffer busy acquire 79wbq83vc3ty4 112 1 2022-11-18 23:27:59 gc buffer busy acquire 7wrp7q9027usg 46 1 2022-11-18 23:27:59 gc buffer busy acquire 7y9dhybsfd028 2 1 2022-11-18 23:27:59 gc buffer busy acquire 8hpztpr1vx9gp 2 1 2022-11-18 23:27:59 gc buffer busy acquire 97khkmrn4kp09 4 1 2022-11-18 23:27:59 gc buffer busy acquire atuz9t4aqvdmm 156 1 2022-11-18 23:27:59 gc buffer busy acquire azr364vxnbwm2 4 1 2022-11-18 23:27:59 gc buffer busy acquire cd3dxusq5djgp 2 1 2022-11-18 23:27:59 gc buffer busy acquire cmnp0rx2k4yz3 6 1 2022-11-18 23:27:59 gc buffer busy acquire d1sh0yu458vfr 2 1 2022-11-18 23:27:59 gc buffer busy acquire d93suk0kjny04 6 1 2022-11-18 23:27:59 gc buffer busy acquire dpsxw3rkuvkqq 4 1 2022-11-18 23:27:59 gc buffer busy acquire f88a91rzaaj5t 2 1 2022-11-18 23:27:59 gc buffer busy acquire fdabpgn8h14ba 4 1 2022-11-18 23:27:59 gc buffer busy acquire fjqnh3937fk06 2 1 2022-11-18 23:27:59 gc buffer busy acquire fzcvkj0mp6wmc 4 1 2022-11-18 23:27:59 gc buffer busy acquire gh6t7aztdn8qu 2 1 2022-11-18 23:27:59 gc buffer busy acquire 2 1 2022-11-18 23:27:59 gc current block busy 5m068a02bdzm3 4 1 2022-11-18 23:27:59 gc current request 00pkrmq6sxw9r 2 1 2022-11-18 23:27:59 gc current request 038d01h60v2zt 2 1 2022-11-18 23:27:59 gc current request 062j8smyzkhwh 2 1 2022-11-18 23:27:59 gc current request 070x0h38a9wdj 2 1 2022-11-18 23:27:59 gc current request 09xfwud5jxbw8 2 1 2022-11-18 23:27:59 gc current request 0dcjdxx0n8cw6 2 1 2022-11-18 23:27:59 gc current request 0fkgdvsz385af 2 1 2022-11-18 23:27:59 gc current request 161cd6mnw8ng6 4 1 2022-11-18 23:27:59 gc current request 17n64vj7dcbyt 6 1 2022-11-18 23:27:59 gc current request 18ard9mx5qvng 2 1 2022-11-18 23:27:59 gc current request 1abtyr0fgr9df 2 1 2022-11-18 23:27:59 gc current request 1qqvtj9u6kyuq 4 1 2022-11-18 23:27:59 gc current request 1qyp832vkzmdy 2 1 2022-11-18 23:27:59 gc current request 1vukxmd490wqd 2 1 2022-11-18 23:27:59 gc current request 2czdn0wnnpjg2 2 1 2022-11-18 23:27:59 gc current request 2gw2xmvr7vtv1 2 1 2022-11-18 23:27:59 gc current request 3fgygmw80svq4 2 1 2022-11-18 23:27:59 gc current request 3gh06zjqtdu2m 8 1 2022-11-18 23:27:59 gc current request 3nht95b2va65x 6 1 2022-11-18 23:27:59 gc current request 3uz6p5nqf6304 4 1 2022-11-18 23:27:59 gc current request 3z3fgx3y6k3gv 2 1 2022-11-18 23:27:59 gc current request 44xyf34792k2d 4 1 2022-11-18 23:27:59 gc current request 4hzdv022v5cp6 6 1 2022-11-18 23:27:59 gc current request 4ncqpqyqhzw7m 2 1 2022-11-18 23:27:59 gc current request 53cz8x8znkmcv 6 1 2022-11-18 23:27:59 gc current request 5amw15qxajduk 2 1 2022-11-18 23:27:59 gc current request 5b4hjw5y8c3pn 6 1 2022-11-18 23:27:59 gc current request 5ktnsgz6zt67h 2 1 2022-11-18 23:27:59 gc current request 5m068a02bdzm3 8 1 2022-11-18 23:27:59 gc current request 65pm9rq1x691x 2 1 2022-11-18 23:27:59 gc current request 667z1rnng9pub 2 1 2022-11-18 23:27:59 gc current request 66cg7zfgzgkc2 6 1 2022-11-18 23:27:59 gc current request 6d7wmntzx7x5b 2 1 2022-11-18 23:27:59 gc current request 6tfq2fcm1rvqy 2 1 2022-11-18 23:27:59 gc current request 6ttzmtxpcx7tw 2 1 2022-11-18 23:27:59 gc current request 6tzsyrszbuwj4 20 1 2022-11-18 23:27:59 gc current request 737yqgwjnvxxf 6 1 2022-11-18 23:27:59 gc current request 79wbq83vc3ty4 8 1 2022-11-18 23:27:59 gc current request 7c2tgwdf07av1 4 1 2022-11-18 23:27:59 gc current request 7fths5kjcw7za 2 1 2022-11-18 23:27:59 gc current request 7jb8htya1uumm 4 1 2022-11-18 23:27:59 gc current request 7rh4jsd9pt3tn 6 1 2022-11-18 23:27:59 gc current request 7wrp7q9027usg 10 1 2022-11-18 23:27:59 gc current request 7y9dhybsfd028 4 1 2022-11-18 23:27:59 gc current request 86369w0pb0rhc 2 1 2022-11-18 23:27:59 gc current request 8hpztpr1vx9gp 16 1 2022-11-18 23:27:59 gc current request 8jbd7zcqr8kzg 8 1 2022-11-18 23:27:59 gc current request 8uz1jman9rqf2 4 1 2022-11-18 23:27:59 gc current request 94fr6pbh6yd6q 4 1 2022-11-18 23:27:59 gc current request 94z0rmj029ad9 4 1 2022-11-18 23:27:59 gc current request 97khkmrn4kp09 10 1 2022-11-18 23:27:59 gc current request 9cbxmhp0r4jqg 6 1 2022-11-18 23:27:59 gc current request 9xdd22c1guxsj 2 1 2022-11-18 23:27:59 gc current request avph9wxh79p42 4 1 2022-11-18 23:27:59 gc current request aw43z1x20qp6w 2 1 2022-11-18 23:27:59 gc current request azr364vxnbwm2 20 1 2022-11-18 23:27:59 gc current request b36hnuv6fzt97 4 1 2022-11-18 23:27:59 gc current request cc6dknr8grn7s 4 1 2022-11-18 23:27:59 gc current request chrhs04u5yd9p 6 1 2022-11-18 23:27:59 gc current request cmnp0rx2k4yz3 2 1 2022-11-18 23:27:59 gc current request cz5q8x68gg32j 12 1 2022-11-18 23:27:59 gc current request d0z8qt8nas9vu 2 1 2022-11-18 23:27:59 gc current request d93suk0kjny04 20 1 2022-11-18 23:27:59 gc current request f7f8g58wgmjpf 2 1 2022-11-18 23:27:59 gc current request f88a91rzaaj5t 10 1 2022-11-18 23:27:59 gc current request f94mgcvs2as5b 2 1 2022-11-18 23:27:59 gc current request fav0q2jkywa7v 2 1 2022-11-18 23:27:59 gc current request fdabpgn8h14ba 10 1 2022-11-18 23:27:59 gc current request fjqnh3937fk06 2 1 2022-11-18 23:27:59 gc current request fvdyxajud739z 8 1 2022-11-18 23:27:59 gc current request fzcvkj0mp6wmc 22 1 2022-11-18 23:27:59 gc current request g97hgk75m2cd0 8 1 2022-11-18 23:27:59 gc current request gnmuju4p9bumr 2 |
对于上述几个top SQL的 gc等待,让客户查询了相关执行计划,plan hash value并没有改变过,跟ash数据表现一致。
到这里就比较有意思了,既然并非SQL本身性能问题,为什么会突然出现400个gc 等待呢?(因为如果是SQL性能问题,通常是逐步叠加等待,并非突然一下就会涌现400个出来)。
分析到这里,我们认为有几个可能性:
1、在23:27分这个时间点,集群心跳有问题,导致gc 操作受到较大影响
2、数据库实例节点1,或者节点2 在23:27:59之前出现了异常,比如核心进程hung,实例hung,挂起等状态,进而影响数据库gc
3、数据库实例1,或者实例2 在23:27:59之前cpu资源异常,如系统cpu 100%等情况,导致lgwr,lms等进程无法获取cpu资源,进而导致该问题。
对于第一个可能性,让客户检查了zabbix监控,并没有发现异常,只是从监控来看,心跳流量较高,高峰期维持在60-70Mb/s,对于比较老旧的小型机环境来讲,相对有一些偏高,但还不不足以产生这么大的影响。
同样监控也发现cpu 资源相对是正常的,因此我们可以排除第1,3 两种可能性原因。
带着这种疑问和猜测,远程了客户环境,检查了节点2 trace目录,发现在当天晚上确实产生了一个diag dump,找了如下内容:
1 2 3 4 5 6 7 8 9 10 |
*** 2022-11-18 23:25:09.055 HM: Short Stack of immediate waiter session ID 12642, OSPID 16261 of hang ID 67 *** 2022-11-18 23:25:09.191 Short stack dump: ksedsts()+592<-ksdxfstk()+48<-ksdxcb()+3376<-sspuser()+688<-<kernel><-_pw_wait()+48<-pw_wait()+112<-sskgpwwait()+432<-skgpwwait()+336<-ksliwat()+3552<-kslwaitctx()+304<-kcrfws()+864<-kcbchg1_main()+26976<-kcbchg1()+352<-ktubgn()+2224<-knledir()+1424<-knleEva()+160<-evaopn2()+1792<-qesltcEvalOutofLineCols()+1568<-qesltcBeforeRowProcessing()+1184<-qerltcNoKdtBufferedInsRowCBK()+464<-qerltcSingleRowLoad()+640<-qerltcFetch()+720<-insexe()+1600<-opiexe()+16336<-kpoal8()+4832<-opiodr()+1920<-ttcpip()+1744<-opitsk()+2992<-opiino()+1696<-opiodr()+1920<-opidrv()+1616<-sou2o()+256<-opimai_real()+624<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80 HM: Short Stack of root session ID 4510, OSPID 3151 of hang ID 67 Short stack dump: ksedsts()+592<-ksdxfstk()+48<-ksdxcb()+3376<-sspuser()+688<-<kernel><-_read_sys()+48<-_read()+224<-$cold_skgfqio()+880<-ksfd_skgfqio()+416<-ksfd_io()+1184<-ksfdread()+992<-kfk_ufs_sync_io()+1568<-kfk_submit_ufs_io()+416<-kfk_submit_io()+224<-kfk_io1()+928<-kfkRequest()+48<-kfk_transitIO()+2608<-kfioSubmitIO()+3248<-kfioRequestPriv()+208<-kfioRequest()+880<-ksfd_kfioRequest()+800<-ksfd_osmio()+624<-$cold_ksfd_io()+464<-ksfdread()+992<-kccrbp()+736<-kccbmp_get()+608<-kccsed_rbl()+432<-kccocx()+2016<-kcc_begin_txn_internal()+272<-kcrfwl()+544<-ksbabs()+1520<-ksbrdp()+2832<-opirip()+1312<-opidrv()+1152<-sou2o()+256<-opimai_real()+352<-ssthrdmain()+608<-main()+336<-main_opd_entry()+80 |
可以看到在23:25:09分,显示节点2 的Session 4510 会话出于hang的状态。那么该会话是什么呢?
Blocking Sid (Inst) | % Activity | Event Caused | % Event | User | Program | # Samples Active | XIDs |
---|---|---|---|---|---|---|---|
4510, 1( 2) | 57.91 | log file switch (checkpoint incomplete) | 54.11 | SYS | oracle@fdcdb2 (LGWR) | 41/120 [ 34%] | 0 |
4510, 1( 2)
|
57.91404374801865857524568633666953489425
|
log file sync | 3.77 |
SYS
|
oracle@fdcdb2 (LGWR)
|
41/120 [ 34%]
|
0
|
6953, 707( 2) | 2.19 | enq: SQ – contention | 2.19 | FDCMGR | JDBC Thin Client | 7/120 [ 6%] | 0 |
3771,39737( 2) | 1.71 | buffer busy waits | 1.71 | FDCMGR | JDBC Thin Client | 7/120 [ 6%] | 0 |
从ASH 报告我们可以得知4510 session 就是我们的LGWR 进程。可见在23:25:09 节点2 lgwr进程就出现了异常。
那么这就能解释通了。。。
我们继续带着疑问分析这个问题,因为过一段时间就自动恢复了,我始终怀疑是应用可能有些不正常,因此让客户检查了之前提到的2个TOP gc 等待的SQL,发现该sql在最近一周的执行计划,分别是13号,16号,18号。 其中这3个时间点都是业务出问题的时间点。唯一不同的是:
13号sql执行次数是9w次,16号是15w次,18号是17w次。根据用户的反馈,13号和16号也出现了类似问题,但是影响要比18号要小,尤其是13号,影响比较短暂。
结合实际情况和现象来看,我认为问题本身还是出现了这个业务模块上,可能是非固定时间执行,当业务量过大时,进而产生了大量gc,最终出现了该问题;实际上从统计来看,随着执行次数增加,sql的单次平均执行时间逐渐拉长。
最后,不得不说,对于这个比较大型的业务系统来讲,单个节点Session在3800-4000左右,每秒redo 在30M/s(单节点). 这么古老的小型机改换了。。。。
本以为故事即将终结,今天客户反馈早上又出现了该问题,虽然影响不大,仍然需要再次分析,从用户提供的diag trace文件来看,跟之前的问题看上去有一些不同:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (xxxxx.xxxxx1) os id: 5621 process id: 3840, oracle@xxxxx1 session id: 17 session serial #: 62917 } is waiting for 'log file switch (checkpoint incomplete)' with wait info: { time in wait: 17.585836 sec (last interval) time in wait: 44.028985 sec (total) timeout after: never wait id: 13 blocking: 0 sessions wait history: * time between current wait and wait #1: 0.000000 sec 1. event: 'latch: redo writing' time waited: 0.008565 sec wait id: 19 p1: 'address'=0xc000000087dfeb48 p2: 'number'=0xcf p3: 'tries'=0x0 * time between wait #1 and #2: 0.000000 sec 2. event: 'log file switch (checkpoint incomplete)' time waited: 3.465187 sec (last interval) time waited: 26.434584 sec (total) wait id: 13 * time between wait #2 and #3: 0.000000 sec 3. event: 'latch: redo writing' time waited: 0.008045 sec wait id: 18 p1: 'address'=0xc000000087dfeb48 p2: 'number'=0xcf p3: 'tries'=0x0 } and is blocked by => Oracle session identified by: { instance: 1 (xxxxx.xxxxx1) os id: 26277 process id: 20, oracle@xxxxx1 (LGWR) session id: 4510 session serial #: 1 } which is waiting for 'control file sequential read' with wait info: { p1: 'file#'=0x1 p2: 'block#'=0x1 p3: 'blocks'=0x1 time in wait: 0.000000 sec timeout after: never wait id: 2047504176 blocking: 1249 sessions wait history: * time between current wait and wait #1: 0.000013 sec 1. event: 'control file sequential read' time waited: 0.000157 sec wait id: 2047504175 p1: 'file#'=0x0 p2: 'block#'=0x1 p3: 'blocks'=0x1 * time between wait #1 and #2: 0.000074 sec 2. event: 'KSV master wait' time waited: 0.000013 sec wait id: 2047504174 * time between wait #2 and #3: 0.000034 sec 3. event: 'control file sequential read' time waited: 0.000242 sec wait id: 2047504173 p1: 'file#'=0x0 p2: 'block#'=0x2e p3: 'blocks'=0x1 } Chain 1 Signature: 'control file sequential read'<='log file switch (checkpoint incomplete)' Chain 1 Signature Hash: 0xf23230fd |
从hung 分析结果来看,所有会话几乎均被lgwr进程阻塞,被阻塞会话超过1200个。其中lgwr在等待control file sequential read。似乎有一些奇怪?
由于是同一个trace文件,可以看到上次问题也就18号故障时间点的情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
*** 2022-11-18 23:24:27.017 =============================================================================== END OF HANG ANALYSIS =============================================================================== *** 2022-11-18 23:25:08.510 Suspected Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 67 HANG VALID 2 4510 4 1446 HIGH GLOBAL Terminate Process inst# SessId Ser# OSPID Event ----- ------ ----- --------- ----- 1 54 16663 9690 enq: TX - index contention 1 6946 47773 9701 gc current request 2 12642 11983 16261 log file switch (checkpoint incomplete) 2 4510 1 3151 control file sequential read ...... *** 2022-11-18 23:25:34.711 Suspected Hangs in the System Root Chain Total Hang Hang Hang Inst Root #hung #hung Hang Hang Resolution ID Type Status Num Sess Sess Sess Conf Span Action ----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- 74 HANG VALID 2 4510 4 2402 HIGH GLOBAL Terminate Process inst# SessId Ser# OSPID Event ----- ------ ----- --------- ----- 1 11 15447 18842 enq: TX - index contention 1 4763 31315 10670 gc current request 2 11642 39 17120 log file switch (checkpoint incomplete) 2 4510 1 3151 not in wait *** 2022-11-18 23:25:41.065 HM: Matched Hang ID=71 first occurred 11/18/2022 23:25:07, matched 1 times 2/345/44037/latch: redo writing <- 2/228/36601/buffer busy waits *** 2022-11-18 23:25:41.065 HM: Matched Hang ID=72 first occurred 11/18/2022 23:25:07, matched 1 times 2/17224/35815/latch: redo writing <- 2/12442/46953/buffer busy waits |
可以看到lgwr进程(session 4510) 也是在等待control file sequential read或者就是not in wait。可见本质上这几次问题是一样的。
此时怀疑检查点方面可能出现了问题,导致长时间没有触发,进一步分析之前的awr发现部分数据确实有一些异常:
1 2 |
physical writes from cache 20,738,580 5,733.14 6.66 physical writes non checkpoint 17,877,771 4,942.28 5.74 |
其中physical writes from cache – physical writes not checkpoint 的部分就是Oracle 增量检查点触发的dbwr physical writes操作。由此可见,增量检查点触发的写比例非常低,也该也就15%。 换句话讲其他85%可能就是db cache自身写脏等操作触发的。
Oracle从 8i引入了增量检查点,10g 版本引入了自动调节增量检查点机制,主要有如下几个相关参数:
1 2 3 4 5 6 7 8 9 10 11 12 |
NAME VALUE DESCRIB -------------------------------------------------- -------------------- ------------------------------------------------------------ _selftune_checkpoint_write_pct 3 Percentage of total physical i/os for self-tune ckpt _log_checkpoint_recovery_check 0 # redo blocks to verify after checkpoint log_checkpoint_interval 0 # redo blocks checkpoint threshold log_checkpoint_timeout 1800 Maximum time interval between checkpoints in seconds _gc_global_checkpoint_scn TRUE if TRUE, enable global checkpoint scn _disable_incremental_checkpoints FALSE Disable incremental checkpoints for thread recovery _disable_selftune_checkpointing FALSE Disable self-tune checkpointing _selftune_checkpointing_lag 300 Self-tune checkpointing lag the tail of the redo log log_checkpoints_to_alert FALSE log checkpoint begin/end to alert file _kdli_checkpoint_flush FALSE do not invalidate cache buffers after write |
Oracle 10g版本开始,默认启用了增量检查点自动调节机制。除了上述跟dbwr写相关的参数之外,还有跟redo log size相关的参数:
1 2 3 |
NAME VALUE DESCRIB -------------------------------------------------- -------------------- ------------------------------------------------------------ _target_rba_max_lag_percentage 81 target rba max log lag percentage |
该参数在11.2版本中默认值为81%,在10g版本中为90%。 该参数大致意思指最大限制情况下,当所有redo log大小写满到其81%的情况下才触发检查点。从mos上文章也看到了一些描述,某些特定情况下增量检查点机制是有待完善的,与cpu_count,log_buffer大小,redo log size,以及mttr有关。
对于用户这个问题,我们建议用户通过如下2个方式调整之后进行观察:
1、log_buffer 调整为300m
2、fast_start_mttr_target 设置为600秒
Leave a Reply
You must be logged in to post a comment.