dg日志应用延时严重故障处理及优化（数据延迟6天将近3t归档日志文件） -m6米乐安卓版下载

原创企系山顶上噶雪梨 2022-07-11

2567

【说明】：操作部分若没有特别说明，都是在目标端

本文包含故障的处理过程，dg目标端应用速度优化，dg监控思路

摘要：11g dg（data gurad）单实例目标端应用延迟6天，将近3t的归档日志为应用，故障处理及数据积压后怎么快速追平数据

故障原因分析：

1）第一时间查看了数据库的打开状态为“ read only ”，而非正常的状态 “ read only with apply ”

sql> select open_mode from v$database;  
open_mode
read only

2）日志怎么会挂了呢？查看告警日志，其中一段显示，磁盘空间不足，造成后台进程mrp0 shutdown掉


09812 arc3: closing local archive destination log_archive_dest_1: '/mnt/vdd1/archivelog/arc0000202222_1013199719.0001.arc' (error 19502) (orcl)
309813 arch: archival stopped, error occurred. will continue retrying
309814 oracle instance orcl - archival error
309815 ora-16038: log 4 sequence# 202222 cannot be archived
309816 ora-19502: write error on file "", block number (block size=)
309817 ora-00312: online log 4 thread 1: '/mnt/oracle/oradata/orcl/redo04.log'
309818 sat jul 02 14:14:06 2022
309819 media recovery log /mnt/vdd1/archivelog/arc0000201136_1013199719.0001.arc
309820 sat jul 02 14:14:08 2022
309821 arch: archival stopped, error occurred. will continue retrying
309822 oracle instance orcl - archival error
309823 ora-16014: log 4 sequence# 202222 not archived, no available destinations
309824 ora-00312: online log 4 thread 1: '/mnt/oracle/oradata/orcl/redo04.log'
309825 errors with log /mnt/vdd1/archivelog/arc0000201136_1013199719.0001.arc
309826 mrp0: background media recovery terminated with error 1237
309827 errors in file /mnt/oracle/diag/rdbms/orcl_st/orcl/trace/orcl_pr00_29562.trc:
309828 ora-01237: cannot extend datafile 91
309829 ora-01110: data file 91: '/mnt/vdd1/oradata/ts_part4_01'
309830 ora-19502: write error on file "/mnt/vdd1/oradata/ts_part4_01", block number 4130560 (block size=8192)
309831 ora-27072: file i/o error
309832 linux-x86_64 error: 28: no space left on device
309833 additional information: 4
309834 additional information: 4130560
309835 additional information: -1
309836 managed standby recovery not using real time apply
309837 sat jul 02 14:15:29 2022
309838 recovery interrupted!
309839 recovered data files to a consistent state at change 17592401117937
309840 sat jul 02 14:15:31 2022
309841 mrp0: background media recovery process shutdown (orcl)

3）14点多发生磁盘空间不足，删除目标端日志的计划任务是在18点，那18点后，dg检测又空间了，继续将归档日志重传到目标端，但是由于日志应用的进程mrp0挂掉，导致数据同步失败

4）大坑：未及时发现，需设置监控告警dg情况

故障解决：

1）考虑归档日志已经被计划任务删除，需要先进行归档日志恢复，目标端最新应用到201136，而主库还存在归档是207163，限于磁盘空间，分多次进行恢复，先进行恢复202315到203000直接的归档日志文件，并后台执行

建立shell脚本：
cat arch_restore.sh

#!/bin/bash
. $home/.bash_profile
rman target / nocatalog msglog /mnt/script/logs/rman-arch.log <


后台执行脚本 :
nohup sh arch_restore.sh >alog07070813.log &

2）重启mrp0，恢复日志应用 (### 然后忙去了，几个小时候去检查，发现才应用了三百个多文件… ###)
alter database recover managed standby database cancel;
alter database recover managed standby database using current logfile disconnect;

3）再查日志，发现需要一个，到源端去拿一个，最后还直接挂了 (### 哭晕在厕所o(╥﹏╥)o… ###)
日志显示：
wed jul 06 18:59:07 2022
archived log entry 206097 added for thread 1 sequence 206015 id 0xe08f092f dest 1:
wed jul 06 18:59:41 2022
media recovery log /mnt/vdd1/archivelog/arc0000201380_1013199719.0001.arc
error opening /mnt/vdd1/archivelog/arc0000201380_1013199719.0001.arc
attempting refetch
media recovery waiting for thread 1 sequence 201380
fetching gap sequence in thread 1, gap sequence 201380-201380
wed jul 06 18:59:42 2022
rfs\[7513\]: allowing overwrite of partial archivelog for thread 1 sequence 201380
rfs\[7513\]: opened log for thread 1 sequence 201380 dbid 1116521861 branch 1013199719
archived log entry 206098 added for thread 1 sequence 201380 rlc 1013199719 id 0x0 dest 2:
media recovery log /mnt/vdd1/archivelog/arc0000201380_1013199719.0001.arc
wed jul 06 19:00:09 2022
archived log entry 206099 added for thread 1 sequence 206016 id 0xe08f092f dest 1:
wed jul 06 19:01:35 2022
media recovery log /mnt/vdd1/archivelog/arc0000201381_1013199719.0001.arc
error opening /mnt/vdd1/archivelog/arc0000201381_1013199719.0001.arc
attempting refetch
media recovery waiting for thread 1 sequence 201381
fetching gap sequence in thread 1, gap sequence 201381-201381
media recovery waiting for thread 1 sequence 201381
fetching gap sequence in thread 1, gap sequence 201381-201381
wed jul 06 19:05:07 2022
fal\[client\]: failed to request gap sequence
gap - thread 1 sequence 201381-201381
dbid 1116521861 branch 1013199719
fal\[client\]: all defined fal servers have been attempted.
check that the control_file_record_keep_time initialization
parameter is defined to a value that's sufficiently large
enough to maintain adequate log switch information to resolve
archivelog gaps.

4）将日志批量复制到备库，并解决gap
– 【目标端】gap查询语句
select count(*) from v$archive_gap;

–【目标端】有时候该视图查询超级慢，也可以考虑用下面语句代替：
select userenv('instance'), high.thread#, low.lsq, high.hsq
from
(select a.thread#, rcvsq, min(a.sequence#)-1 hsq
from v$archived_log a,
(select lh.thread#, lh.resetlogs_change#, max(lh.sequence#) rcvsq
from v$log_history lh, v$database_incarnation di
where lh.resetlogs_time = di.resetlogs_time
and lh.resetlogs_change# = di.resetlogs_change#
and di.status = 'current'
and lh.thread# is not null
and lh.resetlogs_change# is not null
and lh.resetlogs_time is not null
group by lh.thread#, lh.resetlogs_change#
) b
where a.thread# = b.thread#
and a.resetlogs_change# = b.resetlogs_change#
and a.sequence# > rcvsq
group by a.thread#, rcvsq) high,
(select srl_lsq.thread#, nvl(lh_lsq.lsq, srl_lsq.lsq) lsq
from
(select thread#, min(sequence#) 1 lsq
from
v$log_history lh, x$kccfe fe, v$database_incarnation di
where to_number(fe.fecps) <= lh.next_change#
and to_number(fe.fecps) >= lh.first_change#
and fe.fedup!=0 and bitand(fe.festa, 12) = 12
and di.resetlogs_time = lh.resetlogs_time
and lh.resetlogs_change# = di.resetlogs_change#
and di.status = 'current'
group by thread#) lh_lsq,
(select thread#, max(sequence#) 1 lsq
from
v$log_history
where (select min( to_number(fe.fecps))
from x$kccfe fe
where fe.fedup!=0 and bitand(fe.festa, 12) = 12) >= next_change#
group by thread#) srl_lsq
where srl_lsq.thread# = lh_lsq.thread#( )
) low
where low.thread# = high.thread#
and lsq < = hsq
and hsq > rcvsq;

–【源端】复制归档日志目标端
scp -p port username@ip:目标端目录
–【目标端】手动注册归档日志(注册断掉的第一个就好，后续再注册会提示已注册)
alter database register logfile '/${归档目录绝对路径}/arc0000201381_1013199719.0001.arc';

(### 你以为这就完了么？？？ ###)
在线重做日志我设置是500m，大部分归档日志文件在400~500m之间，观察日志应用发现，有时候是1分钟一个，有时候3到5分钟应用一个，有时候又1分钟俩三个，任由下去的话，要被业务部门ko了。怎么加速呢？？

dg目标端应用优化加速
1）开启并行（视负载情况及cpu核数，最大设置cpu*2）
– 取消日志应用
alter database recover managed standby database cancel;

– 默认8通道，更改为12通道
alter database recover managed standby database parallel 16 using current logfile disconnect;

2）主库到备库的日志传输，考虑积压量大，可以传输一段时间关闭一段时间的方式（源端主库操作）
– 停止主库归档日志传输到备库
alter system set log_archive_dest_state_2 = 'defer';

– 恢复归档日志传输
alter system set log_archive_dest_state_2 = enable;

3）优化参数并重启数据库（目标端执行）
– 查看参数设置
show parameter parallel_execution_message_size --值为16384
show parameter filesystemio_options --默认none

– linux操作系统层面可查看
grep kio /proc/slabinfo

– 更改设置为异步io（单机实例，11.2.0.4）
alter system set filesystemio_options=setall scope=spfile;

–指定并行执行中meassage的大小，11g取值2148~32768
alter system set parallel_execution_message_size=32768 scope=spfile;

通过上述1 2 点优化设置后，速度有点提升，但是速度不稳定（感谢lgs提供的思路）；
停掉日志传输，并行度从4、6、8、12、16、 20都尝试过，但资源监控：cpu、io都闲的发慌，占比20%以下，这让我再想加资源都没有理由。
最后发现第3点的参数调优，可以一试，趁着中午业务低峰期，调整参数后，导出pfile，比对之前的初始化参数文件，确定没问题后，重启目标端端数据库【慎之又慎】。
通过以上三点，优化完成后效果显著，结果是可喜的：

速度提升10-20倍，由原来的2分钟1-3个，提升到2分钟30-50个文件
开启dg监控，通过shell脚本 任务计划实现：
1）设定日志监控：通过监控alert日志的关键字（在使用中逐步优化）进行告警
2）通过视图v$dataguard_stats进行监控（单位分钟）：
select to_number(substr(s.value,2,2))2460 to_number(substr(s.value,5,2))*60 to_number(substr(s.value,8,2)) as nums
from v$dataguard_stats s where name='apply lag';

其他相关查询：
– 查询归档还原情况，若是正常在同步中，yes后是in-memory状态，表示正在应用的日志文件,或通过文件号定位最新的应用到的归档，yes为已应用，no为未应用
select g.recid,g.sequence#,g.applied,status,archived,g.first_time from v$archived_log g
where 1=1-- g.sequence#> 201134
and applied='in-memory'
order by g.sequence#

– 查看备库dg状态
select process,client_process,process,status,thread#,sequence#,block#,blocks from v$managed_standby;

– 查询在线日志情况
select thread#, group#, sequence#, bytes, archived, s.status , blocksize, member
from v$standby_log s
join v$logfile using (group#)
order by thread#, group#;

– 查询进程状态 dg库日志应用性能监控
select to_char(sysdate,'dd-mon-yyyy hh24:mi:ss') "current time"
,s.process
, p.spid
, substr(s.program, -6) proc
, s.event
, s.p1
, s.p2
, s.p3
, s.seconds_in_wait siw
, s.seq#
from v$session s, v$process p
where p.addr = s.paddr and (s.program like '%mrp%' or s.program like '%pr0%' or s.program like '�w%' or s.program like '%ckpt%')
order by s.process;

  墨力计划oracle11goracledataguard故障处理
 最后修改时间：2022-09-08 18:07:30
 「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」
 关注作者
 
  
 【米乐app官网下载的版权声明】本文为墨天轮用户原创内容，转载时必须标注文章的来源（墨天轮），文章链接，文章作者等基本信息，否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容，欢迎发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。
   评论

领墨值

 
  有奖问卷
  意见反馈 
 客服小墨

dg日志应用延时严重故障处理及优化（数据延迟6天将近3t归档日志文件） -m6米乐安卓版下载

故障原因分析：

故障解决：

dg目标端应用优化加速

开启dg监控，通过shell脚本 任务计划实现：

其他相关查询：

评论

开启dg监控，通过shell脚本任务计划实现：