5月21日 奇怪的性能问题(上)
前一段时间我在广州参加一个研讨会的时候认识了刘工,他给我讲了一个困扰了他们很久的问题。刘工负责维护的一套RAC系统存在一个很大的问题,其中一个节点启动后是正常的,不过运行一段时间后,性能就会变慢,随着系统启动的时间越长,性能越差,最后只能重启数据库,而另外一个节点就很正常。
这个问题在1年前就出现了,只是刚开始的时候每隔2、3个月出现一次,而最近1号节点变慢的周期从2、3个月缩短到10来天了,每隔10多天重启一次数据库,对于一个7*24的系统来说,是十分讨厌的,因此他们想尽快解决这个问题。不过他们已经请了原厂在内的好几批人,都没有任何进展。这种情况我还真没碰到过,所以我和刘工约好,他去申请一点经费,我帮他们分析一下,看看能不能解决。
今天一大早我就接到了刘工的电话,他告诉我经费已经申请下来了,我们随时可以开始工作。我昨天在广州做一个项目,和可以吃完晚饭后已经10点多了,于是我就没有回深圳,在东山口的东山宾馆住了一晚上,早上刚刚准备退房就接到了刘工的电话,于是我和他约定马上过去,今天就开始分析他们的问题。
这个问题确实比较蹊跷,我看了5月10号和20号两天的STATSPACK报告,10号是性能比较正常的,20号1号节点就出现了很严重的性能问题。从STATSPACK报告的LOAD PROFILE上来看,两天的系统负载差别并不大:
5月10日的数据:
Cache Sizes (end)
~~~~~~~~~~~~~~~~~
Buffer Cache: 6,288M Std Block Size: 8K
Shared Pool Size: 4,224M Log Buffer: 2,048K
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 521,205.84 17,343.34
Logical reads: 163,867.72 5,452.77
Block changes: 2,984.39 99.31
Physical reads: 787.93 26.22
Physical writes: 217.34 7.23
User calls: 9,668.24 321.71
Parses: 3,061.57 101.87
Hard parses: 0.13 0.00
Sorts: 1,109.17 36.91
Logons: 0.99 0.03
Executes: 3,099.00 103.12
Transactions: 30.05
% Blocks changed per Read: 1.82 Recursive Call %: 11.33
Rollback per transaction %: 18.67 Rows per Sort: 94.55
5月20日的数据:
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 611,013.35 18,803.15
Logical reads: 180,201.58 5,545.47
Block changes: 3,679.53 113.23
Physical reads: 2,277.02 70.07
Physical writes: 237.57 7.31
User calls: 14,797.10 455.36
Parses: 4,531.44 139.45
Hard parses: 0.20 0.01
Sorts: 1,560.20 48.01
Logons: 0.68 0.02
Executes: 4,586.51 141.14
Transactions: 32.50
% Blocks changed per Read: 2.04 Recursive Call %: 11.07
Rollback per transaction %: 6.68 Rows per Sort: 112.28
5月20日的系统负载比10号大了大约15%,不过系统的资源十分充足,今天我观察到的CPU使用率仅仅为50%左右,而从LOAD PROFILE上看,今天的系统负载比较高,比20号还要高10%左右。看样子基本上可以排除由于资源瓶颈导致的性能问题。从各个缓冲池的命中率指标上看,两天的差别也不大:
5月10日:
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 99.99 Redo NoWait %: 100.00
Buffer Hit %: 98.96 In-memory Sort %: 100.00
Library Hit %: 99.74 Soft Parse %: 99.99
Execute to Parse %: 5.30 Latch Hit %: 99.90
Parse CPU to Parse Elapsd %: 97.20 % Non-Parse CPU: 88.35
5月20日:
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 99.85 Redo NoWait %: 100.00
Buffer Hit %: 98.75 In-memory Sort %: 100.00
Library Hit %: 100.00 Soft Parse %: 100.00
Execute to Parse %: 1.20 Latch Hit %: 99.73
Parse CPU to Parse Elapsd %: 88.61 % Non-Parse CPU: 89.52
不过从平均事务响应时间上看,就有很大的差异,5月10日的平均事务响应时间是248毫秒,而昨天的平均事务响应时间是2416毫秒,几乎相差了10倍。从TOP EVENTS来看:
5月10日:
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
db file sequential read 7,791,754 71,642 71.34
CPU time 21,417 21.33
global cache cr request 5,005,630 2,346 2.34
global cache null to x 133,460 1,068 1.06
db file scattered read 64,424 515 .51
5月20日:
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
global cache cr request 3,893,750 124,379 44.39
db file sequential read 5,419,977 60,672 21.65
CPU time 35,783 12.77
buffer busy global CR 500,356 15,046 5.37
global cache open x 224,560 8,301 2.96
-------------------------------------------------------------
很明显,5月20日的TOP 5 EVENTS中,global cache cr request占了44.39%,从这个现象上看,应该是RAC出现了性能问题。再往下看,确实发现大量的GCS性能指标存在严重的问题:
5月10日:
Global Cache Service - Workload Characteristics
-----------------------------------------------
Ave global cache get time (ms): 0.3
Ave global cache convert time (ms): 3.3
Ave build time for CR block (ms): 0.1
Ave flush time for CR block (ms): 0.2
Ave send time for CR block (ms): 0.1
Ave time to process CR block request (ms): 0.4
Ave receive time for CR block (ms): 1.0
Ave pin time for current block (ms): 3.6
Ave flush time for current block (ms): 0.1
Ave send time for current block (ms): 0.1
Ave time to process current block request (ms): 3.8
Ave receive time for current block (ms): 2.4
Global cache hit ratio: 1.3
Ratio of current block defers: 0.1
% of messages sent for buffer gets: 1.3
% of remote buffer gets: 0.2
Ratio of I/O for coherence: 1.0
Ratio of local vs remote work: 5.6
Ratio of fusion vs physical writes: 0.0
Global Enqueue Service Statistics
---------------------------------
Ave global lock get time (ms): 0.3
Ave global lock convert time (ms): 0.0
Ratio of global lock gets vs global lock releases: 1.0
GCS and GES Messaging statistics
--------------------------------
Ave message sent queue time (ms): 0.1
Ave message sent queue time on ksxp (ms): 0.7
Ave message received queue time (ms): 0.0
Ave GCS message process time (ms): 0.1
Ave GES message process time (ms): 0.0
% of direct sent messages: 51.7
% of indirect sent messages: 43.2
% of flow controlled messages: 5.1
5月20日:
Global Cache Service - Workload Characteristics
-----------------------------------------------
Ave global cache get time (ms): 16.2
Ave global cache convert time (ms): 35.3
Ave build time for CR block (ms): 0.1
Ave flush time for CR block (ms): 1.0
Ave send time for CR block (ms): 0.1
Ave time to process CR block request (ms): 1.2
Ave receive time for CR block (ms): 33.1
Ave pin time for current block (ms): 2.3
Ave flush time for current block (ms): 0.1
Ave send time for current block (ms): 0.1
Ave time to process current block request (ms): 2.5
Ave receive time for current block (ms): 37.0
Global cache hit ratio: 1.5
Ratio of current block defers: 0.1
% of messages sent for buffer gets: 1.5
3Ratio of I/O for coherence: 1.0
Ratio of local vs remote work: 8.3
Ratio of fusion vs physical writes: 0.1
Global Enqueue Service Statistics
---------------------------------
Ave global lock get time (ms): 1.1
Ave global lock convert time (ms): 0.0
Ratio of global lock gets vs global lock releases: 1.0
GCS and GES Messaging statistics
--------------------------------
Ave message sent queue time (ms): 0.4
Ave message sent queue time on ksxp (ms): 32.8
Ave message received queue time (ms): 0.2
Ave GCS message process time (ms): 0.1
Ave GES message process time (ms): 0.1
% of direct sent messages: 40.5
% of indirect sent messages: 45.2
% of flow controlled messages: 14.3
Ave message sent queue time on ksxp居然达到了32.8毫秒,在一个正常的系统中,这个指标应该小于1毫秒,难道是CLUSTER INTERCONNECT出现了问题?于是马上做了一个ipc的诊断:
SQL>oradebug setmypid
SQL>oradebug ipc
从IPC的DUMP上来看:
SSKGXPT 0x1043be40 flags SSKGXPT_READPENDING socket no 7 IP 172.16.30.11 UDP 39110
RAC INTERCONNECT的配置没有问题,并没有使用错误的网络接口。这个网卡是千兆网卡,两个节点之间是通过直连网线直连的,并没有经过交换机。看样子网络的配置并没有问题,难道是TCP和UDP的BUFFER不足,导致了性能问题吗?
[root]#netstat -s |grep overflow
0 socket buffer overflows
0 ipintrq overflows
0 ipintrq overflows
[root@ossrac1 /]#netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en10 1500 link#2 0.1a.64.a8.d6.36 857956673 0 2207490572 4 0
en10 1500 10.0.30 ossrac1 857956673 0 2207490572 4 0
en10 1500 10.0.30 ossrac1_vip 857956673 0 2207490572 4 0
en11 1500 link#3 0.1a.64.a8.d6.37 3840560330 0 3490362343 3 0
en11 1500 172.16.30 ossrac1_rachb 3840560330 0 3490362343 3 0
en12 1500 link#4 0.1a.64.a8.d5.80 43025069 0 29417227 3 0
en12 1500 172.17.30 ossrac1_hahb 43025069 0 29417227 3 0
en13 1500 link#5 0.1a.64.a8.d5.81 23418517 0 10146725 4 0
en13 1500 10.0.31 ossrac1 23418517 0 10146725 4 0
lo0 16896 link#1 65232739 0 65274627 0 0
lo0 16896 127 loopback 65232739 0 65274627 0 0
lo0 16896 ::1 65232739 0 65274627 0 0
看样子UDP的OVERFLOW为0,相关网络接口的IERRS和OERRS都是0,网络的软硬件配置应该没有问题,这个问题不是由于INTERCONNECT网络方面的配置和性能导致的。网络RAC INTERCONNECT的流量太大,导致了性能问题吗?
通过topas命令我检查了一下,发现有几个lms进程经常出现在top进程里,通过ps aux|grep lms命令检查,发现这些lms进程的rss值都比较高,基本都在800M以上,甚至有一个lms进程的rss超过了2G。难道真的存在某个热表导致了GCS出现性能问题?看样子需要通过v$segment_statistics来检查检查是不是有某些表产生了过多的GLOBAL CR REQUEST了。
col stat format a40 truncate;
col owner format a10 truncate;
select STATISTIC_NAME stat,OWNER,OBJECT_NAME obj,sum(value) Val
from v$segment_statistics
where STATISTIC_NAME like 'global%'
and value > 10000
group by STATISTIC_NAME,OWNER,OBJECT_NAME
order by val desc;
我首先把结果写入一个EXECL文件,10分钟后再次采集相同的数据,两个数据进行比对,从这个间隔10分钟的数据比对来看,并不存在十分热的对象。最多的一张表在10分钟内也只产生了一千多次gcs访问。为了进一步确认GLOBAL CR REQUEST的数量,通过v$cr_block_server视图进行分析:
SQL>SET TIME ON
11:48:40 SQL> select CR_REQUESTS cr, CURRENT_REQUESTS cur,
11:48:53 2 DATA_REQUESTS data, UNDO_REQUESTS undo,
11:48:53 3 TX_REQUESTS tx from v$cr_block_server;
CR CUR DATA UNDO TX
----------- ---------- ---------- ---------- ----------
7266322440 12747555 726652166 5348848 5969191
11:48:53 SQL> /
CR CUR DATA UNDO TX
----------- ---------- ---------- ---------- ----------
7266943300 12747754 726694252 5348848 5969357
11:58:29 SQL>
在接近10分钟的时间里,整个系统仅仅产生了60多万个CR REQUESTS,确实不算很大。为了确认互联网络不存在性能问题,我做了一个简单的测试,我找到了一个100G的dmp文件,通过sftp测试一下通过互联网络传输的性能,在sftp拷贝开始之后,我做了一个statspack的snap采样,20多分钟后文件传输完毕,显示的平均速度是77M/秒,看样子网络的性能是没有问题的,于是我马上又做了一个STATSPACK的SNAP采样,生成了一份这20多分钟的STATSPACK报告,从报告中我们发现GCS的指标和没有做sftp大文件传输时差不多,并没有因为网络带宽被文件传输占用而导致GCS的性能下降。
这就出现了十分令人感到费解的现象了,一方面GLOBAL MESSAGE的传输性能很差,全局消息的延时十分严重,GCS的性能指标也十分糟糕,另外一方面却找不到热点的对象,而且INTERCONNECT网络也没有任何问题。这到底是怎么回事呢?
我正在冥思苦想,刘工开完会从外面回到了办公室,他见面就问:“老白,有进展吗?”我告诉他还没什么进展,这个系统确实十分怪异,看样子问题确实棘手。
刘工说:“老白,先吃饭去吧,我也没指望你一来就能搞定这个问题,这套系统前前后后来过不下10批人马了,也有几批原厂的专家,都没看出什么问题来。”