关于parallel rollback的一点总结
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: 关于parallel rollback的一点总结
1 2 3 4 5 6 |
关于parallel rollback,不少人写过相关的文章,今天偶然看到了,也简单的总结一下。 说到parallel rollback,那我们首先就要来看看与其有重大关系的一个参数: fast_start_parallel_rollback 该参数是oracle 8i引入的,详细信息如下表所示: |
1 2 3 4 5 6 7 8 9 10 11 |
Version Parameter Type Modifiable 11.1.0.7 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 11.1.0.6 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 10.2.0.4 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 10.2.0.3 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 10.1.0.5 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 10.1.0.4 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 9.2.0.8 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 9.0.1.4 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) 8.1.7.4 fast_start_parallel_rollback STRING ALTER SYSTEM (IMMEDIATE) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
该参数如下3种属性值: false -- 即是关闭parallel rollback功能 low -- 也是10g的默认值,该值含有是最大的rollback进程为2*cpu_count个 high -- 当设置为该值时,最大的rollback进程为4*cpu_count个 当然其最大值是要受参数parallel_max_servers的影响的,如果是rac环境,那么还跟参数 parallel_threads_per_cpu有关系,这里需要说明一点的是,该参数跟recovery_parallelism不同的, recovery_parallelism参数是指在进行instance crash recovery时的并行恢复进程个数。 关于并行rollback操作,我们可以通过观察几个试图来进行判断其回滚的时间,如下的几个试图就是 我们需要进行查询的: v$fast_start_transactions V$FAST_START_SERVERS 或 x$ktuxe 下面通过例子来进行说明. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
SQL> show parameter rollback NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ _cleanup_rollback_entries integer 100 _corrupted_rollback_segments string _max_cr_rollbacks integer 0 _offline_rollback_segments string _rollback_segment_count integer 0 _rollback_segment_initial integer 1 _rollback_stopat integer 0 fast_start_parallel_rollback string LOW #### 10g中的默认值 #### rollback_segments string transactions_per_rollback_segment integer 5 SQL> show user USER is "ROGER" SQL> select count(*) from ht1; COUNT(*) ---------- 999 SQL> begin 2 for i in 1 .. 1000 loop 3 insert /*+ append */ 4 into ht1 5 select * from ht1; 6 commit; 7 end loop; 8 end; 9 / begin * ERROR at line 1: ORA-01653: unable to extend table ROGER.HT1 by 1024 in tablespace ROGER ORA-06512: at line 3 SQL> select count(*) from ht1; COUNT(*) ---------- 1022976 ++++++ 在当前session执行全表的delete操作,如下:++++++ SQL> delete from ht1; 1022976 rows deleted. ++++++ 时间大约1分钟左右 ++++++ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
###### 在另一窗口进行如下查询:###### SQL> show user USER is "SYS" SQL> set lines 200 SQL> select XIDUSN,XIDSLOT,XIDSQN,NAME,START_UBASQN,START_UBAREC 2 from v$transaction; XIDUSN XIDSLOT XIDSQN NAME START_UBASQN START_UBAREC ---------- ---------- ---------- ------------------------- ------------ ------------ 6 31 234 221 1 SQL> select KTUXEUSN, KTUXESLT, KTUXESQN, KTUXECFL, KTUXESIZ, sysdate 2 from x$ktuxe 3 where KTUXEUSN = 6 4 and KTUXESLT = 31; KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- --------- 6 31 234 NONE 38912 07-OCT-11 ++++++ 下面进行rollback time计算 ++++++ SQL> show user USER is "ROGER" SQL> set timing on SQL> rollback; Rollback complete. Elapsed: 00:00:46.53 ++++++ 该rollback操作花费了46.53秒 ++++++ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
###### 另外窗口 ###### SQL> alter session set nls_date_format='yyyy-mm-dd hh24:mi:ss'; Session altered. Elapsed: 00:00:00.04 SQL> select KTUXEUSN, KTUXESLT, KTUXESQN, KTUXECFL, KTUXESIZ, sysdate 2 from x$ktuxe 3 where KTUXEUSN = 6 4 and KTUXESLT = 31; KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------ ---------- ------------------- 6 31 234 NONE 3616 2011-10-07 06:32:45 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------ ---------- ------------------- 6 31 234 NONE 2479 2011-10-07 06:32:47 Elapsed: 00:00:00.01 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------ ---------- ------------------- 6 31 234 NONE 0 2011-10-07 06:32:52 Elapsed: 00:00:00.01 SQL> select 38912/((3616-2479)/2) from dual; 38912/((3616-2479)/2) --------------------- 68.4467898 Elapsed: 00:00:00.01 通过我计算发现,应该是需要68.44s才能完成rollback操作,但是为什么实际上46.53s就完成了呢? |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
###### 再次进行相同的测试 ###### SQL> delete from ht1; 1022976 rows deleted. Elapsed: 00:01:31.96 SQL> rollback; Rollback complete. Elapsed: 00:00:36.51 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
###### 另一session窗口 ###### SQL> select XIDUSN,XIDSLOT,XIDSQN,NAME,START_UBASQN,START_UBAREC from v$transaction; XIDUSN XIDSLOT XIDSQN NAME START_UBASQN START_UBAREC ---------- ---------- ---------- ------------------------- ------------ ------------ 2 13 217 207 55 Elapsed: 00:00:00.01 SQL> select KTUXEUSN, KTUXESLT, KTUXESQN, KTUXECFL, KTUXESIZ, sysdate 2 from x$ktuxe 3 where KTUXEUSN = 2 4 and KTUXESLT = 13; KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 38913 2011-10-07 06:55:36 Elapsed: 00:00:00.01 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 32721 2011-10-07 06:55:54 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 29825 2011-10-07 06:55:55 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 29195 2011-10-07 06:55:57 Elapsed: 00:00:00.01 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 25293 2011-10-07 06:56:00 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 22288 2011-10-07 06:56:02 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 21288 2011-10-07 06:56:04 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 18512 2011-10-07 06:56:05 Elapsed: 00:00:00.01 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 17643 2011-10-07 06:56:07 Elapsed: 00:00:00.00 SQL> / KTUXEUSN KTUXESLT KTUXESQN KTUXECFL KTUXESIZ SYSDATE ---------- ---------- ---------- ------------------------ ---------- ------------------- 2 13 217 NONE 16880 2011-10-07 06:56:08 Elapsed: 00:00:00.00 SQL> select * from v$fast_start_transactions; no rows selected Elapsed: 00:00:00.00 SQL> select * from v$fast_start_servers; no rows selected Elapsed: 00:00:00.00 SQL> / no rows selected Elapsed: 00:00:00.00 根据前面的时间数据,我分别计算出2个值如下1238和1102,然后再根据该值取平均值,如下; SQL> select 1238+1102 from dual; 1238+1102 ---------- 2340 SQL> select 2340/2 from dual; 2340/2 ---------- 1170 SQL> select 38913/1170 from dual; 38913/1170 ---------- 33.2589744 我们来看看此次的rollback时间,如下: SQL> show user USER is "ROGER" SQL> delete from ht1; 1022976 rows deleted. Elapsed: 00:01:31.96 SQL> rollback; Rollback complete. Elapsed: 00:00:36.51 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
可以看到,已经非常的接近了,因为前面的时间计算,其实还存在一定的误差,从数学的角度来说, 要想更加精确,需要把时间扩大,进行多次采集然后计算平均值,那样可能就更加接近实际值了。 最后需要做一点补充的是,关于x$ktuxe大家可以参考我以前的一个笔记,简略信息如下: <div id="B" style="height:auto; width:60%;OVERFLOW-y:auto;border:blue 1px solid;margin:10px"> ... ... ... ... ... ... ktuxc 其实并不神秘,跟一个x$表相关,即如下x$表: x$ktuxe [K]ernel [T]ransaction [U]ndo transaction [E]ntry ... ... ... ... ... ... </div> 我们知道,在平常所遇到的数据库故障中,跟undo相关的太多了,基本上ora-4xxx ~都跟undo有着紧密的联系 比如大家所熟知的ora-4193、ora-4000、ora-4194等等。 这里我主要简单的分析下ktuxe的结构,其实关于ktuxe, 还有一个很重要的地方,那就是无法通过v$tranactions查到的死事务,我们可以通过x$ktuxe来获取。 那么ktuxe,ktuxc结构到底多大呢?换句话说,在整个block中,占据了多少个byte呢? 如下: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
COMPONEN TYPE DESCRIPTION TYPE_SIZE -------- -------- -------------------------------- ---------- S EWORD EITHER WORD 4 S EB1 EITHER BYTE 1 1 S EB2 EITHER BYTE 2 2 S EB4 EITHER BYTE 4 4 S UWORD UNSIGNED WORD 4 S UB1 UNSIGNED BYTE 1 1 S UB2 UNSIGNED BYTE 2 2 S UB4 UNSIGNED BYTE 4 4 S SWORD SIGNED WORD 4 S SB1 SIGNED BYTE 1 1 S SB2 SIGNED BYTE 2 2 S SB4 SIGNED BYTE 4 4 S BOOLEAN BOOLEAN 4 S FLOAT FLOAT 4 S DOUBLE DOUBLE 8 S SIZE_T SIZE_T 4 S DSIZE_T DSIZE_T 4 S PTR_T PTR_T 4 K KDBA DATABASE BLOCK ADDRESS 4 K KTNO TABLE NUMBER IN CLUSTER 1 K KSCN SYSTEM COMMIT NUMBER 8 K KXID TRANSACTION ID 8 K KUBA UNDO ADDRESS 8 KCB KCBH BLOCK COMMON HEADER 20 KTB KTBIT TRANSACTION VARIABLE HEADER 24 KTB KTBBH TRANSACTION FIXED HEADER 48 KTB KTBBH_BS TRANSACTION BLOCK BITMAP SEGMENT 8 KDB KDBH DATA HEADER 14 KDB KDBT TABLE DIRECTORY ENTRY 4 KTE KTECT EXTENT CONTROL 44 KTE KTECH EXTENT CONTROL 72 KTE KTETB EXTENT TABLE 8 KTS KTSHC SEGMENT HEADER 8 KTS KTSFS SEGMENT FREE SPACE LIST 20 KTS KTSPHW PAGE TABLE SEGMENT HWM 60 KTS KTSPHC PAGE TABLE SEGMENT HEADER 112 KTS KTSPFHC LEVEL 1 BITMAP BLOCK HEADER 184 KTS KTSPSHC LEVEL 2 BITMAP BLOCK HEADER 96 KTS KTSPTHC LEVEL 3 BITMAP BLOCK HEADER 88 KTU KTUBH UNDO HEADER 16 KTU KTUXE UNDO TRANSACTION ENTRY 40 ==> 跟ktuxc相关 KTU KTUXC UNDO TRANSACTION CONTROL 104 ==> 这里是我们这里需要关注的地方(占据了104个byte) KDX KDXCO INDEX HEADER 16 KDX KDXLE INDEX LEAF HEADER 32 KDX KDXBR INDEX BRANCH HEADER 24 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
BBED> p ktuxc struct ktuxc, 104 bytes @4148 -- 通过bbed我们可以发现,确定是占据了104个byte struct ktuxcscn, 8 bytes @4148 ub4 kscnbas @4148 0x001564b5 -- scn值 ub2 kscnwrp @4152 0x0000 struct ktuxcuba, 8 bytes @4156 -- UBA值 (有如下3部分组成) ub4 kubadba @4156 0x00400181 -- dba(转换后为file 1 block 385) ub2 kubaseq @4160 0x0068 ub1 kubarec @4162 0x0f -- Last Entry in UNDO record map sb2 ktuxcflg @4164 1 (KTUXCFSK) -- 表示inactive ub2 ktuxcseq @4166 0x0068 sb2 ktuxcnfb @4168 1 ub4 ktuxcinc @4172 0x00000000 sb2 ktuxcchd @4176 30 sb2 ktuxcctl @4178 22 -- 这里的部分内容 目前我还没搞清楚是啥意思 ub2 ktuxcmgc @4180 0x8002 ub4 ktuxcopt @4188 0x7ffffffe struct ktuxcfbp[0], 12 bytes @4192 struct ktufbuba, 8 bytes @4192 ub4 kubadba @4192 0x00400181 ub2 kubaseq @4196 0x0068 ub1 kubarec @4198 0x0f sb2 ktufbext @4200 2 sb2 ktufbspc @4202 4280 struct ktuxcfbp[1], 12 bytes @4204 struct ktufbuba, 8 bytes @4204 ub4 kubadba @4204 0x00000000 ub2 kubaseq @4208 0x0065 ub1 kubarec @4210 0x02 sb2 ktufbext @4212 5 sb2 ktufbspc @4214 7886 struct ktuxcfbp[2], 12 bytes @4216 struct ktufbuba, 8 bytes @4216 ub4 kubadba @4216 0x00000000 ub2 kubaseq @4220 0x003e ub1 kubarec @4222 0x25 sb2 ktufbext @4224 2 sb2 ktufbspc @4226 464 struct ktuxcfbp[3], 12 bytes @4228 struct ktufbuba, 8 bytes @4228 ub4 kubadba @4228 0x00000000 ub2 kubaseq @4232 0x003e ub1 kubarec @4234 0x08 sb2 ktufbext @4236 2 sb2 ktufbspc @4238 7130 struct ktuxcfbp[4], 12 bytes @4240 struct ktufbuba, 8 bytes @4240 ub4 kubadba @4240 0x00000000 ub2 kubaseq @4244 0x0000 ub1 kubarec @4246 0x00 sb2 ktufbext @4248 0 sb2 ktufbspc @4250 0 |
1 2 3 |
从上面信息,我们可以发现最后一次使用的undo block是 file 1 block 385。下面来看看ktuxe的情况 下面我们来看看x$ktuxe的结构: |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
SQL> desc x$ktuxe Name Null? Type ------------------ -------- ---------------------------- ADDR RAW(4) INDX NUMBER INST_ID NUMBER KTUXEUSN NUMBER -- undo seq number KTUXESLT NUMBER -- slot 即为事务槽号 KTUXESQN NUMBER -- sequence号 KTUXERDBF NUMBER -- 文件号 即为file id KTUXERDBB NUMBER -- block 号 KTUXESCNB NUMBER -- SCN base for prepare/commit KTUXESCNW NUMBER -- SCN wrap for prepare/commit KTUXESTA VARCHAR2(16) -- 事务状态 (我所知道的应该是有3种 action,inaction,dead) KTUXECFL VARCHAR2(24) -- 事务标志 KTUXEUEL NUMBER -- 这里根据我猜测,应该是指用于存放事务的一个列表 KTUXEDDBF NUMBER -- file id KTUXEDDBB NUMBER -- block number KTUXEPUSN NUMBER -- parent usn number KTUXEPSLT NUMBER -- parent solt number KTUXEPSQN NUMBER -- parent seq# KTUXESIZ NUMBER -- 用于该事务有多少个undo block(单位是block) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
这里我只想说明一点的是,注意KTUXESIZ的单位是block,而不是byte。 本文的重点不是来讲述这个x$表的用途,而是想详细说明parallel rollback 的一些东西, 这里有点要说明的是,rollback分为两种,如下: 1. parallel rollback 即并行回滚 2. Serial rollback 即串行回滚 当初始化参数fast_start_parallel_rollback设置为false以后,即为串行回滚,否则为并行回滚。 我们知道并行回滚是oracle 8i引入的一个特性,那么必然就有它的优势,但是我google了一下, 找到了老白的帖子,说某些情况下并行恢复会非常的慢,但是没有说具体是那些情况下,这点较为 遗憾。 我这里由于是单表所以模拟的时候并没有发现parallel rollback进程,虽然说参数是默认值。 总的来说,不管是使用并行回滚还是串行回滚,我们都可以通过上面的方法来进行估算具体的回滚时间。 通过所掌握的知识来看,我猜测可能在如下的情况下parallel rollback可能会更慢: 1. 相关的操作正常情况下走index,而如果是parallel,那么将走全表扫描,所以会更慢; 2. 存在相关争用的情况下。 另外,在某些情况下我们还可以同调整_cleanup_rollback_entries来加快恢复进度。 |
3 Responses to “关于parallel rollback的一点总结”
DATABASE HANG DUE TO PARALLEL TRANSACTION RECOVERY [ID 464246.1]
——————————————————————————–
修改时间 18-OCT-2010 类型 PROBLEM 状态 PUBLISHED
In this Document
Symptoms
Cause
Solution
——————————————————————————–
Applies to:
Oracle Server – Enterprise Edition – Version: 9.2.0.1 to 10.2.0.4 – Release: 9.2 to 10.2
Information in this document applies to any platform.
Symptoms
Database is hanging. Undo tablespace is growing.
The hang encountered because of parallel transaction recovery as the systemstate dump shows the significant waits for “Wait for a undo record” and “Wait for stopper event to be increased”.
Cause
The systemstate dump shows the following waitevents:
SO: 70000008c9de498, type: 4, owner: 70000008c626cc0, flag: INIT/-/-/0x00
(session) sid: 1099 trans: 0, creator: 70000008c626cc0, flag: (51) USR/- BSY/-/-/-/-/-
DID: 0001-0008-00000003, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
waiting for ‘wait for stopper event to be increased’ blocking sess=0x0 seq=82 wait_time=0
seconds since wait started=156
=0, =0, =0
Dumping Session Wait History
for ‘wait for stopper event to be increased’ count=1 wait_time=97720 =0, =0, =0
for ‘wait for stopper event to be increased’ count=1 wait_time=97681 =0, =0, =0
for ‘wait for stopper event to be increased’ count=1 wait_time=97676 =0, =0, =0
for ‘wait for stopper event to be increased’ count=1 wait_time=97676 …
SO: 70000008c9d6a58, type: 4, owner: 70000008c628460, flag: INIT/-/-/0x00
(session) sid: 1087 trans: 0, creator: 70000008c628460, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0001-000E-00000004, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 0, psql: 0, user: 0/SYS
O/S info: user: oracle, term: UNKNOWN, ospid: 2998446, machine: ct-db3
program: oracle@ct-db3 (P000)
waiting for ‘wait for a undo record’ blocking sess=0x0 seq=69 wait_time=0 seconds since wait
started=243 =0, =0, =0
Dumping Session Wait History
for ‘wait for a undo record’ count=1 wait_time=97677 =0, =0, =0
for ‘wait for a undo record’ count=1 wait_time=97675 =0, =0, =0
for ‘wait for a undo record’ count=1 wait_time=97673 =0, =0, =0
for ‘wait for a undo record’ count=1 wait_time=97682 =0, =0, =0
for ‘wait for a undo record’ count=1 wait_time=97679 …
The above 2 wait events are mainly meant for parallel transaction recovery.
Solution
Solution
======
To disable the parallel rollback by setting the following parameter
fast_start_parallel_rollback = false
Explanation
========
Sometimes Parallel Rollback of Large Transaction may become very slow. After killing a large running transaction (either by killing the shadow process or aborting the database) then database seems to hang, or SMON and parallel query servers taking all the available CPU.
In fast-start parallel rollback, the background process SMON acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes.
Fast start parallel rollback is mainly useful when a system has transactions that run a long time before a commit, especially parallel Inserts, Updates, Deletes operations. When SMON discovers that the amount of recovery work is above a certain threshold, it automatically begins parallel rollback by dispersing the work among several parallel processes.
There are cases where parallel transaction recovery is not as fast as serial transaction recovery, because the PQ slaves are interfering with each other. It looks like the changes made by this transaction cannot be recovered in parallel without causing a performance problem. The parallel rollback slave processes are most likely contending for the same resource, which results in even worse rollback performance compared to a serial rollback.
How to Disable Parallel Transaction Recovery When Parallel Txn Recovery is Active [ID 238507.1]
——————————————————————————–
修改时间 29-APR-2010 类型 PROBLEM 状态 PUBLISHED
Checked for relevance on 18-Sep-2008
Checked for relevance on 29-April-2010
Symptom(s)
~~~~~~~~~~
Parallel Transaction Recovery is taking too long.
You can use V$TRANSACTION USED_UBLK to estimate how long the rollback is going
to take but there is no formula for this. If you shutdown the database after
rollback has started, it will begin where it left off.
You can also look at V$FAST_START_TRANSACTIONS for UNDOBLOCKSDONE
versus UNDOBLOCKSTOTAL.
Change(s)
~~~~~~~~~~
A large transaction got killed or rolled back.
Cause
~~~~~~~
There are cases where parallel transaction recovery is not as fast as serial
transaction recovery, because the pq slaves are interfering with each other.
This depends mainly on the type of changes that need to be made during rollback
and usually may happen when rolling back INDEX Updates in parallel.
Fix
~~~~
Dynamically switch from parallel recovery to serial. If you are in a clustered
environment you will need to do this on all instances at the same time:
1. Find SMON’s Oracle PID:
Example:
SQL> select pid, program from v$process where program like ‘%SMON%’;
PID PROGRAM
———- ————————————————
6 oracle@stsun7 (SMON)
2. Disable SMON transaction cleanup:
SVRMGR> oradebug setorapid
SVRMGR> oradebug event 10513 trace name context forever, level 2
3. Kill the PQ slaves that are doing parallel transaction recovery.
You can check V$FAST_START_SERVERS to find these.
4. Turn off fast_start_parallel_rollback:
alter system set fast_start_parallel_rollback=false;
If SMON is recovering, this command might hang, if it does just control-C out of it. You may need to try this many times to get this to complete (between SMON cycles).
5. Re-enable SMON txn recovery:
SVRMGR> oradebug setorapid
SVRMGR> oradebug event 10513 trace name context off
References
~~~~~~~~~~~
NOTE 144332.1
Parallel Rollback may hang database, Parallel query servers get 100% cpu
这个实验如果添加 insert into select @dblink ,也许就能看到 并行恢复慢了,我也只是猜想!
这里有一个AWR ,和这个有关 http://www.itpub.net/thread-1814543-1-1.html
Leave a Reply
You must be logged in to post a comment.