同事反馈国外点在国内的xxx备库实例宕,尝试将该实例重启,结果重启报如下错误,未能正常启动该数据库。
standby crash recovery failed to bring standby database to a consistent
point because needed redo hasn't arrived yet.
mrp: wait timeout: thread 1 sequence# 0
errors with log dg/xxxx/archivelog/2021_12_29/thread_2_seq_24193.3766.1092589235
standby crash recovery aborted due to error 16016.
errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_ora_67164.trc:
ora-16016: archived log for thread 1 sequence# 22789 unavailable
recovery interrupted!
some recovered datafiles maybe left media fuzzy
media recovery may continue but open resetlogs may fail
completed standby crash recovery.
errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_ora_67164.trc:
ora-10458: standby database requires recovery
ora-01196: file 1 is inconsistent due to a failed media recovery session
ora-01110: data file 1: ' dg/xxxx/datafile/system.419.941189591'
ora-10458 signalled during: alter database open...
帮同事看了下问题,经过多次尝试,最终将数据库重新open,并正常接收主库传来的日志,数据库同步正常。
问题看似解决了,其实只是做了一半,就是为什么这个实例会宕,另外就是这个数据库之前也配置了告警,为什么本次没有发出告警,要分析下原因,避免下次再重新改情况。
该数据库是国外点在国内的一个备库,和主库实时同步,数据库版本是oracle 11.2.0.4 ,该数据库服务器上共部署了两个实例,而本次是只宕了其中一个实例,另一个实例正常。
首先查看了下问题发生时段的alert日志,在日志里看到了在数据库发生故障时所报的日志信息:
errors in file /u01/app/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_lgwr_45171.trc:
ora-04021: timeout occurred while waiting to lock object
lgwr (ospid: 45171): terminating the instance due to error 4021
wed dec 29 17:20:02 2021
system state dump requested by (instance=1, osid=45171 (lgwr)), summary=[abnormal instance termination].
system state dumped to trace file /u01/app/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_diag_45155_20211229172002.trc
dumping diagnostic data in directory=[cdmp_20211229172002], requested by (instance=1, osid=45171 (lgwr)), summary=[abnormal instance termination].
wed dec 29 17:20:02 2021
ora-1092 : opitsk aborting process
wed dec 29 17:20:02 2021
license high water mark = 49
instance terminated by lgwr, pid = 45171
user (ospid: 159932): terminating the instance
instance terminated by user, pid = 159932
wed dec 29 17:20:04 2021
adjusting the default value of parameter parallel_max_servers
from 1280 to 970 due to the value of parameter processes (1000)
starting oracle instance (normal)
************************ large pages information *******************
per process system memlock (soft) limit = 143 gb
total shared global region in large pages = 14 gb (75%)
large pages used by this instance: 7329 (14 gb)
large pages unused system wide = 6 (12 mb)
large pages configured system wide = 17000 (33 gb)
large page size = 2048 kb
从日志里可以看到由于产生了ora-1092 : opitsk aborting process报错,导致该数据库实例宕。此时就要去分析是什么原因导致产生ora-1092报错。首先看了下报错时产生的/u01/app/oracle/diag/rdbms/xxxx/xxxx/trace/xxxx_diag_45155_20211229172002.trc文件,文件内容太多,格式化输出后也不太好排查,希望能从其它地方获取到一些价值信息。
首先在mos上查询了 ora-1092 : opitsk aborting process报错原因,mos上讲该问题产生的原因也比较多,此时尝试看下操作系统日志,看下是否有什么价值的信息。
从操作系统的 /var/log/messages里,发现日志不停的有xxx-xxx-standby-db1 kernel: ext4-fs warning (device sda2): ext4_dx_add_entry: directory index full!这样的报错:
该报错的原因是因为 操作系统小文件太多,导致innode块被占用完了,然后查了下数据库相关文件,发现是两个数据库的审计文件占用了大量innode,通过du -sh 查看都被夯住了,是否是该问题导致数据库实例宕呢。
经查询该数据库已关闭了审计查询,但sys用户依然会产生大量审计文件。
于是通过find 结合 xargs 和rm 将两个实例下的审计文件大量删除,并配置了定时清理任务,后续观察了几天,innode已经由原来的89%降到了当前的34%,操作系统日志未在报index full满的报错信息,备库和主库的同步也正常。