Parallel Query 导致的ORA-04031
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
本文链接地址: Parallel Query 导致的ORA-04031
一个朋友遇到ORA-04031问题。虽然这个错误是非常常见的,然而这里的Case 也有点让人为之震惊!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Tue Aug 26 11:51:13 2014 Errors in file /oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_p485_28873.trc (incident=1589637) ORA-04031: 无法分配 32792 字节的共享内存 ("shared pool","unknown object","sga heap(2,0)","PX msg pool") Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Tue Aug 26 11:51:15 2014 。。。。。。。。 DDE: Problem Key 'ORA 4031' was completely flood controlled (0x6) Further messages for this problem key will be suppressed for up to 10 minutes Tue Aug 26 11:57:12 2014 PMON failed to acquire latch, see PMON dump Tue Aug 26 11:57:15 2014 DIAG (ospid: 14345) has not called a wait for 49 secs. Tue Aug 26 11:57:58 2014 Errors in file /oracle/app/oracle/diag/rdbms/xx/xx1/trace/xx1_qmnc_15686.trc (incident=1608420) ORA-00445: background process "q00T" did not start after 120 seconds Tue Aug 26 11:58:17 2014 。。。。。。。。 Tue Aug 26 12:07:31 2014 WARNING: inbound connection timed out (ORA-3136) Tue Aug 26 12:07:31 2014 WARNING: inbound connection timed out (ORA-3136) Tue Aug 26 12:07:46 2014 PMON failed to acquire latch, see PMON dump |
上述信息为Rac 节点1的信息,对于节点2,信息也基本类似,这里就不贴了。从日志来看,是p621出异常,根据这一点
我们可以判断为是并行进程出现了异常,接着从下面的ORA-04031错误也可以证实这一点。
ORA-04031: 无法分配 32792 字节的共享内存 (“shared pool”,”unknown object”,”sga heap(2,0)”,”PX msg pool”)
从这个错误来看,可以判断P621进程无法分别32792 byte的内存了。这部分内存是需要为Px msg pool进程分配。
这里首先说说PX msg pool是干什么的?
Px msg pool其实是用于Parallel 操作的。通常情况下这部分内存区域应该是从large pool中分配。然而很多业务系统中
由于设置不到,导致parallel 操作仍然使用shared pool的memory。这一点,我们稍后讨论。
接着我们看后面的日子发现PMON进程都无法获得latch了,然后这个时候系统已经不行了。 很明显这个问题的根源就是前面
的ORA-04031错误,这里我们需要分析下trace,来确认下,为什么进程会报错。
该trace文件(xx1_p485_28873.trc)的内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Allocation Request Summary Informaton ===================================== Current information setting: 04014fff SGA Heap Dump Interval=3600 seconds Dump Interval=300 seconds Last Dump Time=08/26/2014 11:51:12 Dump Count=1 Allocation request for: PX msg pool Heap: c00000012fe3ae08, size: 32792 ****************************************************** HEAP DUMP heap name="sga heap(2,0)" desc=c00000012fe3ae08 extent sz=0xfe0 alt=248 het=32767 rec=9 flg=-126 opc=0 parent=0000000000000000 owner=0000000000000000 nex=0000000000000000 xsz=0x10000000 heap=0000000000000000 fl2=0x20, nex=0000000000000000, dsxvers=1, dsxflg=0x0 dsx first ext=0x30000000 latch set 2 of 2 durations disabled for this heap reserved granules for root 0 (granule size 268435456) |
可以看到,PX msg pool请求获得32792 byte的内存大小,从什么地方申请呢? heap 2,也就是shared pool的第2个subpool。
但是却无法分配这么多,以至于报错,下面我们来看下subpool 2的实际情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
Memory Utilization of Subpool 2 ================================ Allocation Name Size ___________________________ ____________ "free memory " 1431685456 ++++整个subpool 2 还有1.4G free memory左右 "miscellaneous " 1680 "vem_user_actlog1 " 0 "SWEEPERR " 0 "File Space Usage " 112144 "trace_knlasg " 0 "ges big msg buffers " 56801208 "dsktab_kfgsg " 57128 "hot latch diagnostics " 0 "gcs res hash bucket " 4194304 "Wait event pointers " 0 "gcs res latch table " 0 "file state object " 4698096 "vips_package_size " 0 "SEQ S.O. " 1057920 。。。。 "gcs resources " 223317648 "gc lock state " 0 "kglsim alloc latch area " 0 "gcs affinity " 0 "gcs opaque in " 49248 "gcs pnp histo " 0 "ges resource " 605005184 "knlp free small rcv buf " 0 "SQLA " 280160 "PRTI " 0 "SQLP " 53608 "SQLK " 28336 "gcs dynamic r " 2556288 "work area tab " 236640 "plwpil:wa " 0 "kebm run-once actions " 0 "grptab_kfgsg " 0 "AW SGA " 0 "dbktb: trace buffer " 53673984 "v_swperrcount " 0 "ksunfy : SSO free list " 27834240 "ksxp IPC state object " 346448 "ksdhng: cache history " 0 "DFW_CONFIG_CAPTURE " 0 "enqueue resources " 862232 "kzekm heap descriptor " 0 "kelt translation table " 0 "IPS_PACKAGE_UNPACK_HISTOR" 0 "kglsim hash table bkts " 2097152 "KSXR global channels " 0 "ges enqueues " 344980144 "trace buf hdr xtend " 2080 "KSXR receive buffers " 1061888 "kglsim Java timestamps " 131072 "VM OSD context " 0 "KCB Table Scan Buffer " 4194816 "knlp receive buffers " 0 "PLS chunk " 0 "knlu_init_btree:init " 0 "KEWS aggregation objs " 0 "DDE_USER_ACTION_DEF " 0 "keswxNotify:tabPlans " 131072 "EM_USER_ACTIVITY " 0 "KSXR large reply queue " 168296 "event-class map " 0 "KSQ event description " 0 "KSXP CONN Stats buffer " 728432 "KSXP CTX Stats buffer " 578128 "KSXP PORT Stats buffer " 142768 "kglsim alloc latches " 0 "kglsim used list arr " 0 "kglsim free obj list " 0 "KTCN global operators " 0 "KESTB existence bitvec " 131072 "KFD extent enqueue obj " 42792 "ksdhng: el wtr cache " 401280 "ksb cic process list " 0 "gcs shadows " 130592800 "value block f " 43832 "ksdhng: blkers cache " 72960 "FileOpenBlock " 255011680 "qmtb_init_xxta " 0 "sort segment handle " 54936 "latchnum to latch map " 0 "LISTEN ADDRESS ENTRY " 0 "vips_pkg_file " 0 "PX msg pool " 47109312 +++++ 当前PX msg pool已经分配了47M左右 。。。。。。 "Checkpoint queue " 116765184 "sim segment hits " 72960 "sim lru segments " 72960 "pesom.c:Global " 0 "peshm.c:Global " 0 "paralltp txn r " 22476896 "ASH buffers " 119537664 "ASM KFFD SO " 0 "statement queuing " 0 "v_actinc " 0 "message pool freequeue " 523240 "ADR_CONTROL " 0 "vshowincb " 0 "vproblem2 " 0 "PMON blockers " 0 "procs: ksunfy " 6408000 "sga listelement " 0 "primem_kfmdsg " 0 "agent name " 0 "SYSTEM PARAMETERS " 232944 "Online xxtafile Move sess" 0 "sched job queue " 0 "vproblem " 0 "object queue hash buckets" 8404992 "object queue hash table d" 229824 "obj stats allocation chun" 1290240 "qmnsf_heap " 0 "sim cache nbufs " 0 "qmu subheap " 53976 "MTTR advisory context " 0 "LGWR-network Server info " 34032 "LGWR debug module memory " 0 "logout storm management " 0 "log_simultaneous_copies " 49248 "sys event stats " 552960 "parameter handle " 4273688 "Parameter Handle " 0 "name-service recovery " 80776 "channel handle " 48360 "parameter blocks " 22016 "qmrcInitSGA:qmrcSGA " 0 "PARAMETER ENTRY " 0 "ASM map operations " 30552 "Security Class " 33280 "GCR " 0 "post/wait queues " 55872 "bloom filter " 0 "Auto tune undo info " 0 "pesldT03_Instance " 33848 "cp pool array " 0 "cp cmon array " 0 "UNDO INFO " 0 "Log Archive Config Contex" 0 "state objects " 107440104 "ArchLog Dest Array " 22784 "parallel kcbibr " 0 "archive_lag_target " 0 "poisoned memory " 32768 "name-service request " 0 "latch recovery structures" 0 "latches for protecting fs" 0 "latches for protecting ls" 0 "Label Cache Heap descrptr" 0 "cp connection " 0 "sched job slv " 0 "bt_subentry " 0 "Saved job ids " 0 "object level stat table " 0 "pso tbs: ksunfy " 780000 "dummy " 0 "qm_init_sga:qmdpsg " 0 "analytic workspace " 0 "sim trace buf " 0 "qm_init_sga:oidctx " 0 "Sort Segment " 16852160 "PRESENTATION ENTRY " 0 "cp srv array " 0 "recov_kgqbtctx " 26512 "dbkea sga " 0 "pesom.c:latch array " 0 "Banner Storage " 0 "qmps connections " 65280 "quiescing session " 0 "qm_init_uga:qmsg " 0 "enqueue " 0 "buffer handles " 5472000 "process group array " 69776 "DML lock " 7066304 "property service SO " 0 "broker globals " 0 "sga node map " 0 "prmtzdini tz region " 801416 "VIEW " 0 "savepoints " 0 "Monitor Modification Hash" 0 "RESUMABLE NAME " 0 "obj stats hash table " 0 "max allowable # log files" 506400 "OS proc request holder " 293424 "message pool context area" 22400 "media recovery state obje" 0 "db_files " 1783968 "base-kglhdusr " 1218016 "constraints " 0 "ASM scan context " 0 "PX server msg stats " 0 "PX server deq stats " 0 "cinfo_kfnsg " 0 "cco_subnode " 0 "cco_pubnode " 0 "KQR X PO " 50414728 "KQR L PO " 5977688 "parameter table block " 31158248 "KQR M SO " 192808 "KQR L SO " 128104 "KQR S SO " 90928 "KSFV SGA " 0 "KSIR SGA " 0 "ASM rollback operations " 0 "kccsgehtt " 0 "kglsgivl2 " 28160 "plugin xxtafile array " 36016 "plwxx:PLW_STR_NEW_RVAL " 0 "plwspv:PLW_STR_NEW_VAL " 0 "lm heartbeat " 0 "KGSK sga " 0 "BRANCH TABLE SEGMENTED AR" 105288 "ksv slave " 0 "RVWR post requested array" 0 "parameter value memory " 0 "RTA Bounxxry Info thread " 71808 "RTA read control file ind" 0 "k2q_lock " 4606648 "ASM extent pointer array " 318197080 。。。。。。。 "kglsim object batch " 338733696 "ktlbk state objects " 5417280 。。。。。 "kglsim heap " 197796864 "ksz parent " 204960 "kgnfs authp " 16464 "event statistics per sess" 29840640 "event statistics ptr arra" 36480 "event list to post commit" 0 "KFG SO child " 0 "KGKP randnum " 40000 Library Cache Dump |
可以看到gcs/ges resource/enqueue这部分内存都消耗了1.4g 左右,还是比较大的,另外ASM extent pointer array
也有300多m。这个可能跟asm的extent分配有关系,对于DW缓解,建议不要使用默认的uniform AU 1m,太小了。
总体来看,从这个subpool的使用情况来看,无法看出是不是碎片严重了。但是从dump下面的library 数据来看,也是
能看出一些端倪的,如下:
1 2 3 4 5 6 7 8 9 |
LIBRARY CACHE STATISTICS: namespace gets hit ratio pins hit ratio reloads invalids -------------- --------- --------- --------- --------- ---------- ---------- SQL AREA 325150175 0.984 2560976170 0.992 2656085 3942426 TABLE/PROCEDURE 73717472 0.996 786560445 0.998 481299 1362 BODY 11738443 1.000 86301250 1.000 521 0 TRIGGER 3247968 1.000 3248823 0.999 153 2 INDEX 576294 0.993 577639 0.974 8448 0 ........ |
大家可以看到,SQL area 这里的reloads和invalids太高了。说明这个时间点shared pool的可用内存可能不多。虽然
从前面的dump看free memory还不少。 我们需要明白一点,Oracle消耗shared pool的内存,是需要连续的一段内存区域。
从这里来看如果如不是shared pool碎片严重那么可能就是有大的SQL操作。我们看该进程的信息发现该进程一直在等待
某个latch Child parallel query alloc buffer,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
SO: 0xc000001842ba9750, type: 2, owner: 0x0000000000000000, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc000001842ba9750, name=process, file=ksu.h LINE:12616 ID:, pg=0 (process) Oracle pid:664, ser:2, calls cur/top: 0xc0000015e1989c10/0xc000001711b743a8 flags : (0x0) - flags2: (0x0), flags3: (0x10) intr error: -2147473260, call error: 0, sess error: 0, txn error 0 intr queue: 2147494036 ksudlp FALSE at location: 0 (post info) last post received: 2147494036 136 12 last post received-location: ksu.h LINE:13927 ID:ksusig last process to post me: c000001832b0b620 242 0 last post sent: 0 0 356 last post sent-location: kxfp.h LINE:4561 ID:kxfprienq: slave last process posted by me: c000001832b0b620 242 0 (latch info) wait_event=0 bits=50 holding (efd=10) c000001804617de8 Child parallel query alloc buffer level=6 child#=23 Location from where latch is held: kxfpb.h LINE:496 ID:kxfpbalo: allocate a buffer Context saved from call: 13835058152010268280 state=busy [holder orapid=664] wlstate=free [value=0] waiters [orapid (seconds since: put on list, posted, alive check)]: 746 (456, 1409025072, 2) 1068 (456, 1409025072, 3) 747 (456, 1409025072, 2) 857 (456, 1409025072, 1) 1067 (456, 1409025072, 3) 748 (456, 1409025072, 1) 983 (434, 1409025072, 4) 1070 (384, 1409025072, 3) 749 (384, 1409025072, 1) 1069 (384, 1409025072, 2) 912 (346, 1409025072, 4) 662 (323, 1409025072, 3) 871 (308, 1409025072, 4) 906 (270, 1409025072, 4) 735 (209, 1409025072, 4) 786 (157, 1409025072, 1) 972 (74, 1409025072, 1) waiter count=17 holding (efd=10) c00000168571bdd0 Child process queue reference level=4 child#=10989 Location from where latch is held: kxfp.h LINE:4438 ID:kxfprialo2: process qref child init Context saved from call: 13835058150030153416 state=busy [holder orapid=664] wlstate=free [value=0] Process Group: DEFAULT, pseudo proc: 0xc0000018330dea98 O/S info: user: oracle, term: UNKNOWN, ospid: 28873 OSD pid info: Unix process pid: 28873, image: oracle@xx01 (P485) PSO child state object changes : |
这个latch Child parallel query alloc buffer 也就是在进行Px msg pool 内存分配的时候需要获得的,当内存分配完毕之后latch会立刻
释放,通常情况下,Oracle 对于Latch的申请和释放是非常之快的。很明显这里这个进程由于申请内存失败,导致latch也没有释放,还阻塞
了一堆的会话。
由于朋友这里没有其他的信息,所以很难准确的定位到是不是碎片或者其他的问题。但是从trace 里面也不难发现一些信息。前面提到进程
申请内存,势必就要看看下这个会话是什么操作了,不看不知道,一看吓一跳,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
MERGE INTO xx.xxxx_A A USING (SELECT /*+ parallel(a,512) parallel(b,8) */ :B2 STATIS_MONTH, NVL(B.PROV_ID, '000') PROV_ID, COUNT(DISTINCT A.SERV_NUM) CNT FROM (SELECT DISTINCT SERV_NUM FROM xxx.TDW_XXXXX_MIGU_MEMBER_M A WHERE STATIS_MONTH = :B1 AND MEMBER_LVL = '1' AND MEMBER_STAT IN ('1', '0') AND EXISTS (SELECT /*+ parallel(b,64)*/ 1 FROM xxx.TKR_DEV_XXXXX_CORE_M B WHERE A.SERV_NUM = B.SERV_NUM)) A, xxx.VDW_NUMBER_SEGMENT B WHERE SUBSTR(A.SERV_NUM, 1, 7) = B.SEGMENT(+) GROUP BY B.PROV_ID) B ON (A.STATIS_MONTH = B.STATIS_MONTH AND A.PROV_ID = B.PROV_ID) WHEN MATCHED THEN UPxxTE SET CORE_NRML = CNT |
大家可以看到,这个sql居然开了512个并行,如果把几个表的并行都加起来,高达584个。问题应用这里还不止这一个SQL。
对于并行查询,我们知道,即使你的并行度设置的再高,也不一定能用这么多,因为这是受限制数据库参数的。不过,悲剧的是,
他这里默认参数都是比较大的。cpu_count=114,默认的dop都高达228。parallel_max_server更是高达好几千了。
还有一点,他的参数parallel_force_local是ture,这也导致parallel进程只能在单个实例进行分配。
其实上述信息都不是重要的,更重要也是更致命的一点的是,并行查询使用的Px msg pool居然从shared pool了进行分配了。
我看信息,他这里的large pool设置高达10gb啊,其实是完全没用派上用场。
对于Px msg pool,如果你的数据库PARALLEL_MIN_SERVERS比较大(这里就是比较大),那么当数据库启动时候,可能Px msg pool
就会消耗不少的内存的。
对于Parallel操作的内存消耗Px memory,可以从large pool分配,也可以从shared pool进行分配。这里需要注意,即使你设置了
large pool,如果不满足下面的条件,那么Px memory仍然会从shared pool去分配。
1) parallel_automatic_tuning 参数设置为auto
2) _PX_use_large_pool 隐含参数设置为true
3) 使用ASMM ,即使设置了sga_target 或 memory_target.
很遗憾的是,他这里3点都不满足。
Leave a Reply
You must be logged in to post a comment.