最近这段时间比较忙,很少写案例了。今天补一个。
前些天处理了一个案例,用户的系统出现了短暂HANG住的现象,一般情况2、3分钟,然后自动好了,有时候HANG住时间接近10分钟,这种故障几乎每隔1、2天就会出现。这个系统是用户最为核心的客服系统,因此需要尽快解决问题。从STATSPACK报告上看:
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 331,508.40 2,565.77
Logical reads: 485,862.63 3,760.43
Block changes: 3,251.90 25.17
Physical reads: 7,697.25 59.57
Physical writes: 1,751.36 13.56
User calls: 10,025.88 77.60
Parses: 2,779.00 21.51
Hard parses: 19.28 0.15
Sorts: 3,565.99 27.60
Logons: 12.42 0.10
Executes: 5,479.72 42.41
Transactions: 129.20
% Blocks changed per Read: 0.67 Recursive Call %: 40.47
Rollback per transaction %: 0.82 Rows per Sort: 14.65
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 99.98 Redo NoWait %: 100.00
Buffer Hit %: 99.20 In-memory Sort %: 100.00
Library Hit %: 99.83 Soft Parse %: 99.31
Execute to Parse %: 49.29 Latch Hit %: 99.57
Parse CPU to Parse Elapsd %: 0.40 % Non-Parse CPU: 95.57
Shared Pool Statistics Begin End
------ ------
Memory Usage %: 84.55 80.57
% SQL with executions>1: 80.78 82.96
% Memory for SQL w/exec>1: 85.53 88.53
从各个缓冲区的命中率来说,似乎都很正常,系统负载也不算高,每秒48万多的逻辑读,每秒120多个事务,对于一个有80个核的系统来说,也不会有多大的问题。不过对于99.83的library hit, % Non-Parse CPU: 95.57这个指标似乎偏低了一些。不过每秒才十几个应分析,似乎共享池也不应该有很严重的问题。继续看下面的TOP EVENTS
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
latch free 10,332,271 1,459,056 87.07
library cache pin 54,447 157,185 9.38
CPU time 35,596 2.12
db file sequential read 2,646,320 9,860 .59
db file scattered read 1,205,552 3,495 .21
-------------------------------------------------------------
LATCH FREE居然是等待事件中最高的,看看每次等待的时间:
Avg
Total Wait wait Waits
Event Waits Timeouts Time (s) (ms) /txn
---------------------------- ------------ ---------- ---------- ------ --------
latch free 10,332,271 22,218 1,459,056 141 22.2
library cache pin 54,447 53,101 157,185 2887 0.1
每次Latch free等待的时间为141毫秒,后面紧跟着的Library cache pin等待居然达到了2887毫秒。看样子很可能是library cache出现了问题。我们马上查看一下Latch相关的内容:
Pct Avg Wait Pct
Get Get Slps Time NoWait NoWait
Latch Requests Miss /Miss (s) Requests Miss
------------------------ -------------- ------ ------ ------ ------------ ------
library cache 97,596,609 7.0 1.2 ###### 32,487 ######
library cache load lock 3,308 0.2 0.0 0 0
library cache pin 73,586,015 0.2 0.0 2 0
library cache pin alloca 30,296,746 0.2 0.0 2 0
row cache enqueue latch 18,005,390 0.7 0.0 0 0
row cache objects 19,646,309 0.3 0.0 1 1,488 0.0
shared pool 19,082,429 13.4 0.8 83091 0
我们看到Library cache和shared pool闩锁等待十分严重。基本上可以确定故障原因肯定和library cache有关。
library cache kglpndl: child: before pro 0 1,457,471 545,317
library cache kglupc: child 0 1,098,140 977,044
library cache kglpnc: child 0 618,362 ########
Latch Miss Sources for DB: SCCRM1 Instance: sccrm1 Snaps: 2381 -2382
-> only latches with sleeps are shown
-> ordered by name, sleeps desc
NoWait Waiter
Latch Name Where Misses Sleeps Sleeps
------------------------ -------------------------- ------- ---------- --------
library cache kglpin: child: heap proces 0 555,259 963,353
library cache kgllkdl: child: cleanup 0 358,150 329,007
library cache kgldti: 2child 0 338,328 28,388
library cache kglpin 0 329,515 45,290
library cache kglhdgn: child: 0 294,123 832,251
library cache kgldte: child 0 0 205,193 519,186
library cache kglobpn: child: 0 203,734 322,489
LIBRARY CACHE闩锁的等待主要集中在pin和unpin上。从共享池的情况来看:
Get Pct Pin Pct Invali-
Namespace Requests Miss Requests Miss Reloads dations
--------------- ------------ ------ -------------- ------ ---------- --------
BODY 209,560 0.0 209,563 0.0 0 0
CLUSTER 305 0.3 334 0.6 0 0
INDEX 241 2.5 235 2.6 0 0
JAVA DATA 10,331 0.0 76,616 0.0 0 0
SQL AREA 397,651 0.4 32,046,553 0.2 5,217 67
TABLE/PROCEDURE 5,410,816 0.0 6,849,315 0.0 829 0
TRIGGER 93,044 0.0 93,045 0.0 2 0
-------------------------------------------------------------
似乎丢失率也很正常,为什么会出现Library cache闩锁的争用呢?再来看看共享池的情况:
Pool Name Begin value End value % Diff
------ ------------------------------ ---------------- ---------------- -------
java free memory 208,429,056 208,953,344 0.25
java memory in use 9,674,752 9,150,464 -5.42
large free memory 268,435,456 268,435,456 0.00
shared KGK heap 131,696 131,696 0.00
shared KGLS heap 8,737,224 7,587,768 -13.16
shared KQR L PO 8,046,320 8,139,552 1.16
shared KQR L SO 1,313,696 1,313,696 0.00
shared KQR M PO 19,439,080 19,828,744 2.00
shared KQR M SO 1,732,520 1,732,520 0.00
shared KQR S SO 33,712 33,712 0.00
shared KQR X PO 4,423,232 4,423,232 0.00
shared MTTR advisory 7,960,792 7,960,792 0.00
shared OUTLNCACHE 18,480 18,480 0.00
shared PL/SQL DIANA 2,604,624 2,497,632 -4.11
shared PL/SQL MPCODE 8,041,344 7,833,064 -2.59
shared PL/SQL PPCODE 682,512 695,320 1.88
shared PL/SQL SOURCE 6,704 12,888 92.24
shared PLS non-lib hp 2,088 2,088 0.00
shared PX subheap 399,440 399,440 0.00
shared dictionary cache 4,274,432 4,274,432 0.00
shared errors 1,078,008 850,816 -21.08
shared fixed allocation callback 5,000 5,000 0.00
shared free memory 927,847,960 1,167,309,024 25.81
shared joxlod: in ehe 481,920 477,000 -1.02
shared joxlod: in phe 66,376 45,624 -31.26
shared joxs heap init 4,240 4,240 0.00
shared library cache 1,628,937,584 1,524,783,640 -6.39
shared miscellaneous 1,520,681,824 1,520,676,800 -0.00
shared parameters 146,616 136,216 -7.09
shared partitioning d 125,840 125,840 0.00
shared session param values 17,647,864 6,013,168 -65.93
shared sim memory hea 58,014,776 58,014,776 0.00
shared sql area 1,783,248,464 1,660,844,208 -6.86
shared table definiti 17,560 11,456 -34.76
shared temporary tabl 56,696 38,464 -32.16
shared trigger defini 3,760 3,464 -7.87
shared trigger inform 4,056 4,896 20.71
shared trigger source 2,376 2,056 -13.47
buffer_cache 58,720,256,000 58,720,256,000 0.00
fixed_sga 833,960 833,960 0.00
log_buffer 63,177,728 63,177,728 0.00
-------------------------------------------------------------
好像也没有多大的问题,空闲的空间有900M以上。至此我们已经排除了最常见的共享池问题的可能性。那么剩下来唯一的可能性就是Library cache闩锁本身的访问过于频繁,导致闩锁数量不足,从而出现问题。于是我检查了一下library cache闩锁的数量,发现是53个,Library cahce闩锁的数量一般来说是高于CPU_COUNT数量的最小的素数,在一个80核的系统里,怎么才有53个子闩锁呢?于是我马上查看了一下cpu_count参数,发现该参数的值居然是48。在9i的系统中,某些平台下面,CPU_COUNT参数是在数据库安装时设置的,扩容CPU后不会自动变更。于是我提出了优化方案:
1、修改CPU_COUNT=80
2、将共享池大小从4G扩充到5G
其实第二项修改是一个老DBA才能做出的,一般DBA往往会根据当前的一些表象来确定修改方案。Memory Usage %: 84.55 80.57,从这个值上面可以看出实际共享池的使用率并不高,很难做出扩大共享池的决定。而作为一个老DBA,我考虑到LIBRARY CACHE闩锁增加后,共享池的并发操作会大幅度增加,可能引起共享池使用率的提升,在本系统CPU使用率不足50%并且内存还有20多G空闲的前提条件下,加大共享池的风险很小,而且可以避免出现衍生问题。在这种只有一次调整机会的核心系统中,采用一个补充方案可以避免很多不必要的麻烦。
这次调整效果十分明显,调整后系统故障消失了,而且平均事务响应时间居然下降了35%以上:
Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
--------------- ---------------
Redo size: 1,295,745.45 1,910.93
Logical reads: 597,059.42 880.53
Block changes: 11,042.94 16.29
Physical reads: 15,512.65 22.88
Physical writes: 2,896.21 4.27
User calls: 33,470.84 49.36
Parses: 8,367.61 12.34
Hard parses: 10.64 0.02
Sorts: 21,393.72 31.55
Logons: 23.10 0.03
Executes: 18,865.66 27.82
Transactions: 678.07
% Blocks changed per Read: 1.85 Recursive Call %: 41.97
Rollback per transaction %: 0.28 Rows per Sort: 5.76
Instance Efficiency Percentages (Target 100%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Buffer Nowait %: 99.95 Redo NoWait %: 100.00
Buffer Hit %: 98.11 In-memory Sort %: 100.00
Library Hit %: 99.97 Soft Parse %: 99.87
Execute to Parse %: 55.65 Latch Hit %: 99.59
Parse CPU to Parse Elapsd %: 11.01 % Non-Parse CPU: 99.63
Shared Pool Statistics Begin End
------ ------
Memory Usage %: 88.14 95.92
% SQL with executions>1: 77.07 73.18
% Memory for SQL w/exec>1: 76.03 74.59
Top 5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
-------------------------------------------- ------------ ----------- --------
CPU time 48,883 50.97
db file sequential read 5,584,789 22,400 23.36
db file scattered read 2,543,891 7,200 7.51
log file sync 2,515,520 6,077 6.34
buffer busy waits 1,040,603 3,193 3.33
-------------------------------------------------------------
调整后,每秒钟PARSE CALL的数量从2000多增加到8000多,但是LIBRARY CAHCE的问题消失。
这个案例里面蕴含的一些技巧我想也不需要我总结了,在上面的字里行间,大家细细去体会吧。实际上这个案例用到的技术都是最基础的,并没有什么高深的内容,但是如果碰到这样的案例,谁能够很快做出正确的判断呢?