Linux信号量设置不当导致Sys cpu%过高(Oracle 19c版本)
本站文章除注明转载外,均为本站原创: 转载自love wife love life —Roger的Oracle/MySQL/PostgreSQL数据恢复博客
近期某客户的环境出现了不正常的一幕,Linux sys% cpu消耗过高,高峰期间甚至高达30%+,比usr%还要高。
1 2 3 4 5 6 7 8 |
09:52:31:130[root@dbxxxx12 ~]# dstat -cldsnmy 09:52:31:327----total-cpu-usage---- ---load-avg--- -dsk/total- ----swap--- -net/total- ------memory-usage----- ---system-- 09:52:31:328usr sys idl wai hiq siq| 1m 5m 15m | read writ| used free| recv send| used buff cach free| int csw 09:52:32:331 4 3 93 0 0 0| 122 121 114| 92M 52M| 57M 16G| 0 0 | 668G 865M 196G 142G| 103k 126k 09:52:33:332 42 22 35 0 0 1| 122 121 114| 307M 39M| 57M 16G| 176M 237M| 668G 865M 196G 143G| 510k 211k 09:52:34:327 42 25 32 0 0 1| 129 122 114| 269M 39M| 57M 16G| 186M 230M| 666G 865M 196G 144G| 511k 205k 09:52:35:331 44 23 32 0 0 1| 129 122 114| 218M 73M| 57M 16G| 198M 231M| 666G 865M 196G 144G| 536k 226k 09:52:36:011 41 19 39 0 0 1| 129 122 114| 243M 74M| 57M 16G| 191M 245M| 666G 865M 196G 144G| 513k 223k |
在早高峰到来之前可以看到sys高达25,这是不正常的。在后续的分析过程中通过多次top 抓取发现大量的scmn进程消耗cpu过多:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
10:38:46:790top - 10:38:46 up 106 days, 16:30, 14 users, load average: 135.09, 132.80, 118.06 10:38:46:790Tasks: 9641 total, 96 running, 9541 sleeping, 2 stopped, 2 zombie 10:38:46:791%Cpu(s): 18.4 us, 33.2 sy, 0.0 ni, 47.1 id, 0.4 wa, 0.0 hi, 0.8 si, 0.0 st 10:38:46:791KiB Mem : 10561102+total, 15898280+free, 68324761+used, 21387985+buff/cache 10:38:46:791KiB Swap: 16777212 total, 16718588 free, 58624 used. 34448185+avail Mem 10:38:46:791 10:38:46:792 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10:38:46:792 63161 grid 20 0 13.9g 232460 25268 S 244.1 0.0 2235:19 java 10:38:46:793204947 oracle 20 0 511.8g 146492 85788 S 104.6 0.0 795:44.23 ora_scmn_scsbgj 10:38:46:793 42759 oracle 20 0 514.5g 69600 49480 R 100.0 0.0 18:27.64 ora_p012_scsbgj 10:38:46:794 52139 oracle 20 0 514.5g 45440 33828 R 100.0 0.0 9:57.99 oracle_52139_sc 10:38:46:794 50493 oracle 20 0 514.5g 66764 50092 R 99.7 0.0 25:01.16 oracle_50493_sc 10:38:46:794 66749 oracle 20 0 514.6g 1.1g 45736 R 99.7 0.1 56:52.12 oracle_66749_sc 10:38:46:795138507 oracle 20 0 514.5g 65384 46368 R 99.7 0.0 6:11.54 oracle_138507_s 10:38:46:795168126 oracle 20 0 514.5g 45352 33796 R 99.7 0.0 4:45.57 oracle_168126_s 10:38:46:795 42757 oracle 20 0 514.6g 72864 50128 R 99.4 0.0 15:36.45 ora_p011_scsbgj 10:38:46:796110597 oracle 20 0 515.6g 1.1g 45884 R 99.4 0.1 44:27.95 oracle_110597_s 10:38:46:796200607 oracle 20 0 511.5g 59424 44772 R 99.4 0.0 87:29.93 oracle_200607_s 10:38:46:796213135 oracle 20 0 508.6g 82416 58100 R 99.4 0.0 161:10.08 ora_p00a_scsbgj 10:38:46:821213139 oracle 20 0 504.6g 82072 56868 R 99.4 0.0 171:51.14 ora_p00c_scsbgj 10:38:46:822133223 oracle 20 0 514.5g 53288 39532 R 98.8 0.0 0:29.17 oracle_133223_s 10:38:46:822 42755 oracle 20 0 514.6g 74040 49984 R 98.1 0.0 15:12.97 ora_p010_scsbgj 10:38:46:822104726 oracle 20 0 514.5g 101796 47080 R 94.8 0.0 2:12.13 oracle_104726_s 10:38:46:822 14575 oracle 20 0 514.5g 67668 48688 R 94.1 0.0 0:55.08 oracle_14575_sc 10:38:46:822204884 oracle 20 0 511.8g 145732 85564 S 91.0 0.0 798:03.08 ora_scmn_scsbgj 10:38:46:822204841 oracle 20 0 511.8g 145568 85424 S 88.9 0.0 789:03.39 ora_scmn_scsbgj 10:38:46:823149469 oracle 20 0 514.5g 62060 44872 R 88.6 0.0 38:57.63 oracle_149469_s 10:38:46:823204853 oracle 20 0 511.8g 146232 85672 S 88.3 0.0 832:18.07 ora_scmn_scsbgj 10:38:46:823204890 oracle 20 0 511.8g 146084 85848 S 88.0 0.0 799:14.89 ora_scmn_scsbgj 10:38:46:823 91621 oracle 20 0 514.5g 45576 33972 R 87.7 0.0 7:40.24 oracle_91621_sc 10:38:46:823204803 oracle 20 0 511.8g 145720 85660 S 87.0 0.0 802:39.04 ora_scmn_scsbgj 10:38:46:823204799 oracle 20 0 511.8g 145916 85872 S 86.7 0.0 796:50.94 ora_scmn_scsbgj 10:38:46:824204823 oracle 20 0 511.8g 145956 85900 S 86.7 0.0 802:19.08 ora_scmn_scsbgj 10:38:46:824204933 oracle 20 0 511.8g 1.1g 1.1g S 86.4 0.1 807:57.54 ora_scmn_scsbgj 10:38:46:824204914 oracle 20 0 510.8g 146160 85684 S 85.8 0.0 793:21.89 ora_scmn_scsbgj 10:38:46:824204817 oracle 20 0 511.8g 146728 86156 S 85.2 0.0 797:27.14 ora_scmn_scsbgj 10:38:46:824204905 oracle 20 0 511.8g 146144 85848 S 84.9 0.0 795:12.41 ora_scmn_scsbgj 10:38:46:824204968 oracle 20 0 511.8g 1.1g 1.1g S 84.9 0.1 810:06.94 ora_scmn_scsbgj 10:38:46:825157837 oracle 20 0 514.5g 54372 40712 R 83.6 0.0 2:37.20 oracle_157837_s 10:38:46:825204797 oracle 20 0 511.8g 145416 85696 S 83.0 0.0 792:29.84 ora_scmn_scsbgj 10:38:46:825204850 oracle 20 0 511.8g 146036 85804 S 83.0 0.0 796:05.20 ora_scmn_scsbgj 10:38:46:825204801 oracle 20 0 511.8g 146068 85764 S 82.7 0.0 796:12.43 ora_scmn_scsbgj 10:38:46:825204861 oracle 20 0 511.8g 145956 85652 S 82.7 0.0 794:52.26 ora_scmn_scsbgj 10:38:46:825204897 oracle 20 0 511.8g 146024 85976 S 82.7 0.0 821:30.15 ora_scmn_scsbgj 10:38:46:825204828 oracle 20 0 511.8g 146712 86408 S 82.1 0.0 802:29.78 ora_scmn_scsbgj 10:38:46:826204846 oracle 20 0 511.8g 146112 86128 S 81.5 0.0 796:14.10 ora_scmn_scsbgj |
通过perf top可以抓取到相关的堆栈信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Overhead Shared Object Symbol 25.59% [kernel] [k] native_queued_spin_lock_slowpath 9.91% oracle [.] kcbgtcr 3.81% oracle [.] kaf4reasrp1km 2.85% oracle [.] kaf4reasrp0km 2.82% oracle [.] kdstf110010100000000km 2.42% oracle [.] kcbrls 1.62% oracle [.] qetlbr 1.16% [kernel] [k] _raw_spin_unlock_irqrestore 1.13% oracle [.] kdstf010010100001000km 1.06% oracle [.] kafger 1.06% oracle [.] lnxcpn 1.06% oracle [.] lxkLikeUTF8 0.93% oracle [.] evaopn2 0.83% oracle [.] ktrgcm 0.80% oracle [.] ktrvac 0.76% oracle [.] kjbrfnd 0.70% oracle [.] kcbzar 0.70% oracle [.] qertbFetchByRowID 0.63% oracle [.] __intel_avx_rep_memset 0.63% oracle [.] kcbz_fr_buf 0.63% oracle [.] kdifxs0 0.63% oracle [.] kdstf010010100000000km 0.63% oracle [.] lxsCnvCaseUTF8 0.60% [kernel] [k] __do_softirq 0.60% [kernel] [k] i40e_get_tx_pending 0.53% [kernel] [k] __nf_conntrack_find_get 0.53% oracle [.] evareo 0.50% oracle [.] kcbz_fr_buf 0.51% [kernel] [k] finish_task_switch |
scmn进程本身是Oracle 12c 引入新特性Multi-Threaded architecture of processes 时所带来的新特性,尽管改新特性在19c中默认仍然是关闭的;可以通过如下如下参数设置为true来进行启用:
threaded_execution = true
对于此类新特性,我个人建议暂时先不要使用,毕竟Oracle 默认仍然将其关闭,可见目前并不稳定。
最后根据High SYS CPU Usage ON LMS Thread (SCMN/CR00/RS01) During High Workload (Doc ID 2707048.1) 的描述来看,配合我们后续perf top 抓取的堆栈,基本上是符合的。
最终还是将信号量做了调整;将
kernel.sem =12000 1536000 12000 128
调整为:
kernel.sem =1024 66666 1024 256
从目前来看,该问题仅存在18c+的版本中,至少我们在现有客户环境中12.2环境中没有发现该问题(同样环境压力也很大,process设置也非常高,均超过5000-单节点).
总结:
1、18c+版本,尤其是现在大家使用19c版本,需要注意信号量的设置,并非越大越好;够用即可;否则可能命中sys%消耗过高的问题;
2、经过19c默认并没有启用多线程进程特性,然而部分进程仍然使用了多线程,猜测这是触发该问题的关键。99% 是Bug导致。
Leave a Reply
You must be logged in to post a comment.