近期某天下午手机不断收到zabbix告警短信,提示同城灾备机房一台oracle 11g生产asm日志有报错,报错信息如下:
[xxx-hg-xxx-db-oracle]oracle ora asm log error(s) found on phx-h-db-oracle-10.xx.xx.xxx-db1: problem (value: >ora-27090: unable to reserve kernel resources for asynchronous disk i/o
ora-27090: unable to reserve kernel resources for asynchronous disk i/o。
该服务器是一套11g rac集群,前期这套数据库两台服务器硬件过保,过保之前使用的是centos 6操作系统,新采购的服务器部署了centos 7.9操作系统,在系统替换前未出现同类报错信息。
登陆该数据库环境,查看日志,显示只有节点一asm日志有此报错,报错信息如下:
errors in file /u01/app/grid/diag/asm/ asm/ asm1/trace/ asm1_ora_122303.trc:
ora-27090: unable to reserve kernel resources for asynchronous disk i/o
linux-x86_64 error: 2: no such file or directory
additional information: 3
additional information: 128
additional information: 1
从该日下午16点出现报错后,频繁报错,甚至达到每隔几分钟收到一条相应告警短信。
查询网上及mos资料,显示ora-27090: unable to reserve kernel resources for asynchronous disk i/o和操作系统aio设置相关。
根据mos文章 ora-27090 - unable to reserve kernel resources for asynchronous disk i/o (doc id 579108.1)显示因为aio-max-nr设置过低导致。
cause
the “aio-max-nr” kernel limit is too low.
solution
the “aio-max-nr” kernel limit should be adjusted according to oracle recommendations which are available in this document:
查询到当前该服务器aio-nr和aio-max-nr信息如下:
[root@xxxxx ~]# cat /proc/sys/fs/aio-max-nr
1048576
[root@xxxx ~]# cat /proc/sys/fs/aio-nr
1047552
aio-nr是在io_setup系统调用上为所有当前活动的aio上下文指定的事件数的运行总数。如果aio-nr达到aio-max-nr,那么io_setup将因eagain而失败。
通过查询报错日志里的trc文件初步判定和定时监控asm_diskgroup相关。
[root@xxxx ~]# more /u01/app/grid/diag/asm/ asm/ asm1/trace/ asm1_ora_171522.trc
trace file /u01/app/grid/diag/asm/ asm/ asm1/trace/ asm1_ora_171522.trc
oracle database 11g enterprise edition release 11.2.0.4.0 - 64bit production
with the real application clusters and automatic storage management options
oracle_home = /u01/app/11.2.0/grid
system name: linux
node name: xxx-xxx-xx-db1
release: 3.10.0-1160.el7.x86_64
version: #1 smp mon oct 19 16:18:59 utc 2020
machine: x86_64
instance name: asm1
redo thread mounted by this instance: 0
oracle process number: 33
unix process pid: 171522, image: oracle@xxx-xxx-xx-db1 (tns v1-v3)
*** 2023-04-23 09:07:33.222
*** session id:(64.55981) 2023-04-23 09:07:33.222
*** client id:() 2023-04-23 09:07:33.222
*** service name:() 2023-04-23 09:07:33.222
*** module name:(sqlplus@xxx-xxx-xx-db1 (tns v1-v3)) 2023-04-23 09:07:33.222
*** action name:() 2023-04-23 09:07:33.222
warning:could not increase the asynch i/o limit to 256 for kfdparallelio.
*** 2023-04-23 09:07:33.222
dbkeddefdump(): starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- error stack dump -----
ora-27090: unable to reserve kernel resources for asynchronous disk i/o
linux-x86_64 error: 11: resource temporarily unavailable
additional information: 3
additional information: 128
additional information: 386140135
----- current sql statement for this session (sql_id=7v21cmm3d7z26) -----
select name,state from v$asm_diskgroup
----- call stack trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
skdstdst() 41 call kgdsdst() 000000000 ? 000000000 ?
7ffeb42e36d0 ? 7ffeb42e37a8 ?
7ffeb42e8250 ? 000000002 ?
ksedst1() 103 call skdstdst() 000000000 ? 000000000 ?
7ffeb42e36d0 ? 7ffeb42e37a8 ?
7ffeb42e8250 ? 000000002 ?
根据资料,修改aio-max-nr为一个较大值,调整之前推荐的大小为aio-max-nr=1048576
修改前:
[root@xxxx~]# cat /proc/sys/fs/aio-nr
1048576
可使用如下命令进行修改无需重启数据库和操作系统:
sysctl -w fs.aio-max-nr=50000000
为避免修改后对会对系统产生影响,并在测试环境提前进行了测试。
修改后:
[root@xxxx~]# cat /proc/sys/fs/aio-max-nr
50000000
因为是rac环境,虽然第二节点没有报错,但为了保持系统参数值相同,也同样修改了二节点的aio-max-nr值。
修改后,持续观察了几天,未再收到该报错信息。