今天碰到一个案例,虚惊了一场。一个客户说今天换了块板子,重启数据库的时候数据库启动后很快就宕了,ALERT LOG信息如下:Errors in file /oracle/app1/oracle/product/9.2/admin/xxxlyg/udump/xxxlyg_ora_4964.trc:ORA-07445: 出现异常: 核心转储 [kgllkdl()+3072] [SIGSEGV] [Address not mapped to object] [0x290000000000008] [] []Wed Mar 10 21:52:12 2010Errors in file /oracle/app1/oracle/product/9.2/admin/xxx/udump/xxxlyg_ora_4894.trc:ORA-00600: 内部错误代码,参数: [26599], [1], [229], [], [], [], [], []ORA-29549: 类SYS.oracle/jdbc/dbacc我问客户做了什么改动,客户说没有任何改动,只是正常SHUTDOWN,然后关机,换板子,重启服务器,再重启数据库,数据库启动后,LISTENER一启动,数据库就报错宕了。我查了一下,发现有两个BUG和这个情况类似:Bug 3691672和 Bug 2882661 ,于是建议客户先打PATCH 3691672,并且建议客户查查JVM有没有什么问题。由于这是一个限制性PATCH,于是找人下补丁。对于BUG 2882661,可以通过下面的方法修复JVM:begin initjvmaux.rollbacksetup;commit; initjvmaux.rollbackset;delete from java$rmjvm$aux;commit; initjvmaux.rollbackset;insert into java$rmjvm$aux (select joxftobn from x$joxfc where bitand(joxftflags,96)=0);commit; initjvmaux.rollbackset;delete from java$rmjvm$aux where obj# in (select obj# from obj$ where type#=29 and mtime>(select date_loaded from registry$ where cid='CATPROC'));commit; initjvmaux.rollbackset;delete from dependency$ where d_obj# in (select obj# from java$rmjvm$aux); commit;initjvmaux.rollbackset; update obj$ set status=5 wheretype#=29 and obj# in (select obj# from java$rmjvm$aux); commit;initjvmaux.rollbackset; delete from java$rmjvm$aux;commit;
前几天流沙的ORA-1591问题,由于是QQ对话,可能对于对ORA-1591缺乏经验的人来说不容易看懂,本帖针对这个问题进行进一步的介绍。这样参考那个实例就可以更加清晰的了解ORA-1591问题了。 ORA-01591: "lock held by in-doubt distributed transaction %s"
Cause: Trying to access resource that is locked by a dead
two-phase commit transaction that is in prepared state.
Action: DBA should query the pending_trans$ and related tables,
and attempt to repair network connection(s) to
coordinator and commit point. If timely repair is not
possible, DBA should contact DBA at commit point if
known or end user for correct outcome, or use heuristic
default if given to issue a heuristic commit or abort
command to finalize the local portion of the
distributed transaction. 总的来说ORA-1591的产生原因是分布式事务失败,失败的原因很多,比如网络问题、XA资源管理器存在BUG等,都可能引起失败。一旦分布式事务失败,本地事务中,如果有一个事务挣处于活跃状态,那么该事务相关的数据就会被锁定(无论读写都会被锁定),如果访问这个事务关联的数据,就会报ORA-1591。一般情况下,ORA-1591可以自动的解开,SMON会在一定时间周期内检查DBA_2PC_PENDING,找出需要回退的事务,并进行自动的恢复。这里就有几个问题,由于分布式事务超时判断以及RECO处理周期的关系,一般来说事务自动恢复的时间为1分钟以上,较长的可以达到5-10分钟。可能会对生产系统造成比较大的影响。为了加快解锁,可以使用手工处理。这个时候可以使用ROLLBACK FORCE或者COMMIT FORCE。有时候由于分布式事务恢复出现故障,会出现数据字典不一致,此时该分布式事务就无法正常解除,需要手工干预来处理。分析方法:1、检查分布式事务的状态:SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID, STATE, MIXED, HOST, COMMIT#FROM DBA_2PC_PENDINGWHERE LOCAL_TRAN_ID = '报错的本地事务号'2、检查分布式事务相关其他节点的情况:
一、背景资料操作系统:redhat linux as4 update4 Linux yhshora1 2.6.9-42.ELlargesmp #1 SMP Wed Jul 12 23:46:39 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux数据库版本:10.2.0.3 CRS版本:10.2.0.3BUNDLE PATCH:6160398 节点数量:2故障描述: 系统上线后,每隔一段时间(周期不定,从几天到10多天),其中一个节点会莫名其妙的重启。甚至在系统负载很轻的时候(临晨3点多)也会莫名其妙的重启。客户已经在METALINK上开了SR,Oracle的Philippe 负责跟进这个SR,要求采集了RDA数据,由于没有操作系统的情况,建议客户安装了OSWATCH。由于缺乏资料,Philippe 无法进行进一步的分析,SR处于PENDING状态。3月初,在业务高峰,CRS再次重启,此次OSW采集了部分操作系统信息,而由于缺失部分Oracle后台进程的日志,Oracle方面的分析仍然无法顺利进行。Philippe 建议:We just can delay the reboot, but the clusterware need to reboot the blocked system to avoid full database hangs.The remaining instance and system is blocked during that time (during the reconfiguration or because the failingnode is holding some locks/buffer needed by the other node). So, it is always possible todelay the reboot a bit, but at the expense of a longer hanging situation for the remaining node.note:284752.1 explains how to change the css timeout value. It is set today to 60 (seconds), but can be set to 300 totry to avoid that an eviction get declared rapidly. It then give 5min to the blocked system to recope from the activityblocking everything (instead of 1min). 并建议安装如下补丁:Please install the patch:5679560 (see note:5679560.8). It will update the init* files with scripts that are morerobust agains sporadic cpu shortages, i.e. they don't trigger a potentially unnecessary reboot dueto bug:5679560. They can also log something more in case of problems in the /var/log/messagesPlease keep collecting the oswatcher data afterwards for a while (in case of).由于客户的系
从一个已经过时快10年的参数开始说起,确实远了点。不过用过Oracle7和Oracle80的朋友肯定对这个参数有印象。如果一个大事务死了,那么是一件很痛苦的事情,因为要等PMON来进行回滚,而PMON为了保证不被大事务搞死,每次只处理该参数定义的那么多UNDO记录。因此大事务如果突然被中断,特别是处理事务的SESSION突然故障退出,那么事务的回滚十分缓慢。因此从Oracle 8i开始引入了一个新的机制,就是fast start parallel rollback机制,SMON可以使用并行恢复的机制来进行事务的恢复。通过并行恢复,来弥补以前大事务回滚机制的不足。fast_start_parallel_rollback可以设置为false,low,high。 不过并行恢复机制存在一些问题,有时候会导致恢复占用过多的CPU资源(如果设置为HIGH),也有可能导致并行恢复十分缓慢。如果你从:select * from v$fast_start_servers; 看到多个SERVER都在RECOVER,那么说明工作正常,如果只有一个是RECOVER,其他都是IDLE,那么这种情况下,并行恢复可能比旧的恢复机制还要慢。这个时候如果你想取消并行恢复,那么如果直接修改fast_start_parallel_rollback=false可能会HANG住,因为SMON正在做RECOVER。这个时候,可以按照Note:238507.1 How to Disable Parallel Transaction Recovery When Parallel Txn Recovery is Active中的方法: 1. Find SMON's Oracle PID:
Example:
SQL> select pid, program from v$process where program like '%SMON%';
PID PROGRAM
---------- ------------------------------------------------
6 oracle@stsun7 (SMON)
2. Disable SMON transaction cleanup:
SVRMGR> oradebug setorapid
SVRMGR> oradebug event 10513 trace name context forever, level 2
3. Kill the PQ slaves that are doing parallel transaction recovery.
You can check V$FAST_START_SERVERS to find these.
4. Turn off fast_start_parallel_rollback:
alter system set fast_start_parallel