一次打oracle小补丁引发的事故,环境是linux 7上的19c两节点rac,打完小补丁过了几个小时,其中一个节点发生重启现象。
- db alert日志无报错信息,05:17:22 开始启动数据库
2022-01-11t04:55:07.629428 08:00
tt02 (pid:67750): srl selected for t-2.s-26402 for lad:2
2022-01-11t04:55:16.834464 08:00
arc2 (pid:67638): archived log entry 100056 added for t-2.s-26401 id 0xecb9a320 lad:1
# 重启前无报错信息
2022-01-11t05:17:22.625738 08:00
starting oracle instance (normal) (os id: 15301)
2022-01-11t05:17:22.777374 08:00
************************************************************
instance sga_target = 307200 mb and sga_max_size = 307200 mb
************************************************************
- crs alter日志(故障节点2)
2022-01-11 00:32:49.113 [ocssd(72170)]crs-7503: the oracle grid infrastructure process 'ocssd' observed communication issues between node 'rac2' and node 'rac1', interface list of local node 'rac2' is '192.168.0.2:54393;', interface list of remote node 'rac1' is '192.168.0.1:18424;'.
2022-01-11 01:06:58.153 [evmd(68443)]crs-7503: the oracle grid infrastructure process 'evmd' observed communication issues between node 'rac2' and node 'rac1', interface list of local node 'rac2' is '192.168.0.2:40437;', interface list of remote node 'rac1' is '192.168.0.1:43662;'.
# 重启前无报错信息
2022-01-11 05:15:52.838 [ohasd(61516)]crs-8500: oracle clusterware ohasd process is starting with operating system process id 61516
2022-01-11 05:15:52.909 [ohasd(61516)]crs-0714: oracle clusterware release 19.0.0.0.0.
2022-01-11 05:15:52.924 [ohasd(61516)]crs-2112: the olr service started on node rac2.
2022-01-11 05:15:53.251 [ohasd(61516)]crs-1301: oracle high availability service started on node rac2.
# 此处显示 05:08:56 节点2 由于大多数表决磁盘没有完成 i/o,oracssdagent 即将重新启动此节点。
2022-01-11 05:15:53.254 [ohasd(61516)]crs-8011: reboot advisory message from host: rac2, component: cssagent, with time stamp: l-2022-01-11-05:08:56.490
2022-01-11 05:15:53.255 [ohasd(61516)]crs-8013: reboot advisory message text: oracssdagent is about to reboot this node due to no i/o completions with majority of voting disks.
2022-01-11 05:15:53.256 [ohasd(61516)]crs-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 1 were announced and 0 errors occurred
2022-01-11 05:15:53.809 [orarootagent(69333)]crs-8500: oracle clusterware orarootagent process is starting with operating system process id 69333
2022-01-11 05:15:53.814 [cssdagent(69347)]crs-8500: oracle clusterware cssdagent process is starting with operating system process id 69347
2022-01-11 05:15:53.815 [cssdmonitor(69354)]crs-8500: oracle clusterware cssdmonitor process is starting with operating system process id 69354
2022-01-11 05:15:53.832 [oraagent(69343)]crs-8500: oracle clusterware oraagent process is starting with operating system process id 69343
2022-01-11 05:15:54.485 [oraagent(69655)]crs-8500: oracle clusterware oraagent process is starting with operating system process id 69655
- crs alter日志(远程节点1)
2022-01-11 03:54:42.911 [cvud(117742)]crs-10051: cvu found following errors with clusterware setup : prvg-13159 : on node "rac2" the file "/etc/resolv.conf" could not be parsed because the file is empty.
prve-0421 : no entry exists in /etc/fstab for mounting /dev/shm
prvf-4664 : found inconsistent name resolution entries for scan name "rac-scan"
prvg-11368 : a scan is recommended to resolve to "3" or more ip addresses, but scan "rac-scan" resolves to only "/192.168.0.13"
2022-01-11 05:06:04.438 [crsd(110493)]crs-7503: the oracle grid infrastructure process 'crsd' observed communication issues between node 'rac1' and node 'rac2', interface list of local node 'rac1' is '192.168.0.1:15060;', interface list of remote node 'rac2' is '192.168.0.2:33424;'.
2022-01-11 05:07:54.319 [oraagent(120835)]crs-5818: aborted command 'check' for resource 'ora.listener_scan1.lsnr'. details at (:crsagf00113:) {1:16836:2} in /oracle/app/grid/diag/crs/rac1/crs/trace/crsd_oraagent_grid.trc.
2022-01-11 05:08:04.332 [evmd(67934)]crs-7503: the oracle grid infrastructure process 'evmd' observed communication issues between node 'rac1' and node 'rac2', interface list of local node 'rac1' is '192.168.0.1:43662;', interface list of remote node 'rac2' is '192.168.0.2:40437;'.
2022-01-11 05:08:14.620 [crsd(110493)]crs-7503: the oracle grid infrastructure process 'crsd' observed communication issues between node 'rac1' and node 'rac2', interface list of local node 'rac1' is '192.168.0.1:15060;', interface list of remote node 'rac2' is '192.168.0.2:33424;'.
2022-01-11 05:08:55.476 [ocssd(71998)]crs-1663: member kill issued by pid 15278-387376 for 1 members, group db asm. details at (:cssgm00044:) in /oracle/app/grid/diag/crs/rac1/crs/trace/ocssd.trc.
2022-01-11 05:08:55.579 [ocssd(71998)]crs-1607: node rac2 is being evicted in cluster incarnation 516104351; details at (:cssnm00007:) in /oracle/app/grid/diag/crs/rac1/crs/trace/ocssd.trc.
# 节点2 由于大多数表决磁盘没有完成 i/o 被 reboot。
2022-01-11 05:08:56.492 [ohasd(58992)]crs-8011: reboot advisory message from host: rac2, component: cssagent, with time stamp: l-2022-01-11-05:08:56.490
2022-01-11 05:08:56.494 [ohasd(58992)]crs-8013: reboot advisory message text: oracssdagent is about to reboot this node due to no i/o completions with majority of voting disks.
2022-01-11 05:09:03.844 [ocssd(71998)]crs-7503: the oracle grid infrastructure process 'ocssd' observed communication issues between node 'rac1' and node 'rac2', interface list of local node 'rac1' is '192.168.0.1:54820;', interface list of remote node 'rac2' is '192.168.0.2:58358;'.
2022-01-11 05:09:25.713 [ocssd(71998)]crs-1601: cssd reconfiguration complete. active nodes are rac1 .
2022-01-11 05:09:27.305 [crsd(110493)]crs-5504: node down event reported for node 'rac2'.
2022-01-11 05:09:28.523 [crsd(110493)]crs-2773: server 'rac2' has been removed from pool 'generic'.
2022-01-11 05:09:28.525 [crsd(110493)]crs-2773: server 'rac2' has been removed from pool 'ora.orcl'.
-
通过osw分析系统资源使用情况
分析iostat,以下是5个 voting disk
所有磁盘的繁忙程度
可见磁盘的问题不大,下面分析vmstat可以看出当时内存使用非常严重
使用了swap
cache使用较多,不知道为啥
cpu使用率不高
-
知识点
oracle@rac2:/home/oracle> cat /proc/sys/vm/min_free_kbytes # 页最小阈值
4194304
页低阈值 pages_low = pages_min*5/4 = 5242880k
页高阈值 pages_high = pages_min*3/2 = 6291456k
# 剩余内存小于页最小阈值, 说明进程可⽤内存都耗尽了, 只有内核才可以分配内存。
# 剩余内存落在页最小阈值和页低阈值中间, 说明内存压力比较大, 剩余内存不多了。 这时 kswapd0 会执⾏内存回收,直到剩余内存大于页高阈值为止。
# 剩余内存落在页低阈值和页高阈值中间, 说明内存有一定压力, 但还可以满⾜新内存请求。
# 剩余内存大于页高阈值, 说明剩余内存比较多, 没有内存压力。
- 节点2重启前,剩余内存较少(5190700k), 小于 pages_low(5242880k),内存压力大。
zzz ***tue jan 11 05:05:32 cst 2022
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
16 1 4490752 5207092 84692 253125568 0 0 478 273 0 0 2 2 95 0 0
6 1 4494592 5189996 84708 253123104 0 3324 30951 34025 197572 111405 1 3 96 0 0
4 0 4494848 5190700 84732 253122400 0 188 24189 3557 146797 88372 1 2 98 0 0
linux oswbb v8.3.2 rac2
snap_interval 15
cpu_cores 80
vcpus 160
oswbb_archive_dest /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive
zzz ***tue jan 11 05:16:06 cst 2022
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
12 1 0 380316096 6660 1503980 0 0 159 8 372 219 1 4 95 0 0
7 0 0 378362304 7852 1584300 0 0 24612 4577 90896 98582 6 2 92 0 0
15 0 0 377275776 9536 1601152 0 0 21580 21502 102854 123845 4 2 94 0 0
- 数据库 sga(300g) pga(80g) 进程内存(3000个进程左右,预估每个3m,合计9g) < 400g
[root@pg13 ~]# cat oswps.sh
fname=$1
data_array=(`grep "zzz" $fname |awk -f"***" '{print $2}' |sed 's/ /_/g'`)
for i in ${!data_array[@]};
do
sed_start_1=${data_array[$i]}
sed_start_2=`echo $sed_start_1 |sed 's/_/ /g'`
i=$i 1
sed_end=${data_array[$i]}
sed_end=`echo $sed_end |sed 's/_/ /g'`
num=`sed -n "/$sed_start_2/,/$sed_end/p" $fname |egrep ^oracle |wc -l`
echo $sed_start_1 $num
done
[root@pg13 ~]# sh oswps.sh rac2_ps_22.01.11.0500.dat
tue_jan_11_05:03:47_cst_2022 3022
tue_jan_11_05:04:02_cst_2022 3031
tue_jan_11_05:04:17_cst_2022 0
tue_jan_11_05:16:06_cst_2022 6
tue_jan_11_05:16:21_cst_2022 0
tue_jan_11_05:16:36_cst_2022 0
tue_jan_11_05:16:51_cst_2022 0
tue_jan_11_05:17:06_cst_2022 6
tue_jan_11_05:17:21_cst_2022 2
tue_jan_11_05:17:36_cst_2022 74
- 分析 meminfo
从 meminfo 看,cache使用较多(250g), 匿名页使用也较多(163g),cache一般为文件系统页缓存(oracle asm 不使用 cache),从ps的监控数据看,未找到异常占用内存的进程。
但后面分现 meminfo 中还有一个异常点,就是 pagetables 占的内存较多(pagetables: 91600048 kb),剩余的大页较多(hugepages_free: 153805),不正常。
zzz ***tue jan 11 05:05:32 cst 2022
memtotal: 790552132 kb
memfree: 5201744 kb
memavailable: 0 kb
buffers: 84692 kb
cached: 250267324 kb # <<<<< 250g
swapcached: 17512 kb
active: 163760368 kb # <<<<< 163g
inactive: 113563000 kb
active(anon): 163400368 kb
inactive(anon): 113033788 kb
active(file): 360000 kb
inactive(file): 529212 kb
unevictable: 4251944 kb
mlocked: 4252088 kb
swaptotal: 20971516 kb
swapfree: 16480764 kb
dirty: 204 kb
writeback: 0 kb
anonpages: 31370864 kb
mapped: 207003588 kb
shmem: 249168216 kb
slab: 2858256 kb
sreclaimable: 962168 kb
sunreclaim: 1896088 kb
kernelstack: 119856 kb
pagetables: 91600048 kb # <<<<< 91g
nfs_unstable: 0 kb
bounce: 0 kb
writebacktmp: 0 kb
commitlimit: 218627868 kb
committed_as: 313497152 kb
vmalloctotal: 34359738367 kb
vmallocused: 2418400 kb
vmallocchunk: 34357114360 kb
hardwarecorrupted: 0 kb
anonhugepages: 0 kb
cmatotal: 0 kb
cmafree: 0 kb
hugepages_total: 192988
hugepages_free: 153805 # <<<<< 300g
hugepages_rsvd: 0
hugepages_surp: 0
hugepagesize: 2048 kb
directmap4k: 34501632 kb
directmap2m: 242862080 kb
directmap1g: 528482304 kb
继续分析,查看db alert log,可以看到在打完补丁,起库的日志中看到大页只分配了39183(76g),剩余共享内存使用的4k的页,分配了58581835(223g)。
2022-01-11t00:28:34.776117 08:00
domain name: user.slice
2022-01-11t00:28:34.776192 08:00
per process system memlock (soft) limit = unlimited
2022-01-11t00:28:34.776257 08:00
expected per process system memlock (soft) limit to lock
instance max shared global area (sga) into memory: 300g
2022-01-11t00:28:34.776383 08:00
available system pagesizes:
4k, 2048k
2022-01-11t00:28:34.776504 08:00
supported system pagesize(s):
2022-01-11t00:28:34.776567 08:00
pagesize available_pages expected_pages allocated_pages error(s)
2022-01-11t00:28:34.776633 08:00
4k configured 14 58581835 none
2022-01-11t00:28:34.777018 08:00
2048k 39387 153601 39183 none
- 这里为什么没有全用上大页呢,怀疑当时起库时可用的大页不足。
查看打补丁时的停库(00:13:23)和起库时间(00:28:34)
2022-01-11t00:04:57.263597 08:00
thread 2 advanced to log sequence 26368 (lgwr switch), current scn: 17283142928376
current log# 7 seq# 26368 mem# 0: data/orcl/onlinelog/group_7.398.1080744843
2022-01-11t00:04:57.492759 08:00
tt03 (pid:103953): srl selected for t-2.s-26368 for lad:2
2022-01-11t00:05:02.189653 08:00
arc1 (pid:47485): archived log entry 99931 added for t-2.s-26367 id 0xecb9a320 lad:1
# 停库时间
2022-01-11t00:13:23.046802 08:00
shutting down oracle instance (immediate) (os id: 43111)
2022-01-11t00:14:10.945385 08:00
stopping background process rbal
2022-01-11t00:14:14.725352 08:00
freeing rdom 0
freeing the fusion rht of pdb 0
2022-01-11t00:14:28.584765 08:00
warning: 2 processes are still attacheded to shmid 753673:
(size: 57344 bytes, creator pid: 18969, last attach/detach pid: 28293)
instance shutdown complete (os id: 43111)
# 起库时间
2022-01-11t00:28:34.620322 08:00
starting oracle instance (normal) (os id: 54670)
2022-01-11t00:28:34.773499 08:00
************************************************************
instance sga_target = 307200 mb and sga_max_size = 307200 mb
************************************************************
- 起库前(00:28:34)大页内存的使用情况,剩余39808(77g),果然起库时可用的大页不足。
zzz ***tue jan 11 00:28:24 cst 2022 memtotal: 790552132 kb memfree: 359520492 kb memavailable: 354082620 kb buffers: 142740 kb cached: 10248144 kb ... hugepages_total: 192988 hugepages_free: 39808 hugepages_rsvd: 421 hugepages_surp: 0 hugepagesize: 2048 kb
- 所以再次怀疑打补丁关库(00:13:23)后,大页没有释放。
# 停库时间
2022-01-11t00:13:23.046802 08:00
shutting down oracle instance (immediate) (os id: 43111)
2022-01-11t00:14:10.945385 08:00
stopping background process rbal
2022-01-11t00:14:14.725352 08:00
freeing rdom 0
freeing the fusion rht of pdb 0
2022-01-11t00:14:28.584765 08:00
warning: 2 processes are still attacheded to shmid 753673:
(size: 57344 bytes, creator pid: 18969, last attach/detach pid: 28293)
instance shutdown complete (os id: 43111)
可以看到数据库是在00:14:28分关闭的,但大00:22:09时,大页还没有释放。
zzz ***tue jan 11 00:22:09 cst 2022 memtotal: 790552132 kb memfree: 361520900 kb memavailable: 355090684 kb buffers: 130284 kb cached: 8306876 kb ... hugepages_total: 192988 hugepages_free: 39808 hugepages_rsvd: 421 hugepages_surp: 0 hugepagesize: 2048 kb
同时细看日志,发现关闭时,仍有进程在使用共享内存。
2022-01-11t00:14:28.584765 08:00 warning: 2 processes are still attacheded to shmid 753673: (size: 57344 bytes, creator pid: 18969, last attach/detach pid: 28293)
通过ps监控数据,定位到关库后,还没有关闭的server进程(82595),在00:20:33已经关完数据库的情况下又产生一个进程(73140)。正常数据库关闭后,server进程会自动关掉,这次挺奇怪。
grid@rac2:/home/grid> cat /oracle/app/grid/.../oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:14:38/,/00:28:24/p'|egrep "^oracle" |grep oracleorcl2
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 73140 73132 19 0.0 0.0 458072 14960 pipe_wait s 00:20:33 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 73140 73132 19 0.0 0.0 458072 14960 pipe_wait s 00:20:33 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 73140 73132 19 0.0 0.0 458072 14960 pipe_wait s 00:20:33 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 73140 73132 19 0.0 0.0 458072 14960 pipe_wait s 00:20:33 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
追踪一下进程(82595)
grid@rac2:/home/grid> cat /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive/oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:15:23/,/00:15:38/p'|egrep "^oracle" |grep oracleorcl2
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
grid@rac2:/home/grid> cat /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive/oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:15:23/,/00:15:38/p'|egrep 82535
oracle 82595 82535 19 0.0 0.0 315033232 15416 pipe_wait s 00:13:37 00:00:00 oracleorcl2 (description=(local=yes)(address=(protocol=beq)))
oracle 82535 72407 19 0.0 0.0 114040 13312 n_tty_read s 00:13:37 00:00:00 sqlplus as sysdba
grid@rac2:/home/grid> cat /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive/oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:15:23/,/00:15:38/p'|egrep 72407
oracle 82535 72407 19 0.0 0.0 114040 13312 n_tty_read s 00:13:37 00:00:00 sqlplus as sysdba
oracle 72407 72281 19 0.0 0.0 116788 3320 do_wait s 00:13:32 00:00:00 -bash
grid@rac2:/home/grid> cat /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive/oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:15:23/,/00:15:38/p'|egrep 72281
oracle 72407 72281 19 0.0 0.0 116788 3320 do_wait s 00:13:32 00:00:00 -bash
oracle 72281 69874 19 0.0 0.0 163376 2420 poll_schedule_timeout s 00:13:32 00:00:00 sshd: oracle@pts/2
grid@rac2:/home/grid> cat /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive/oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:15:23/,/00:15:38/p'|egrep 69874
root 69874 58356 19 0.0 0.0 163376 6036 poll_schedule_timeout s 00:13:31 00:00:00 sshd: oracle [priv]
oracle 72281 69874 19 0.0 0.0 163376 2420 poll_schedule_timeout s 00:13:32 00:00:00 sshd: oracle@pts/2
grid@rac2:/home/grid> cat /oracle/app/grid/oracle.ahf/data/repository/suptools/rac2/oswbb/grid/archive/oswps/rac2_ps_22.01.11.0000.dat |sed -n '/00:15:23/,/00:15:38/p'|egrep 58356
root 122271 58356 19 0.0 0.0 163376 6032 poll_schedule_timeout s 00:04:16 00:00:00 sshd: oracle [priv]
root 108972 58356 19 0.0 0.0 165196 5928 poll_schedule_timeout s 09:10:36 00:00:00 sshd: oracle [priv]
root 69874 58356 19 0.0 0.0 163376 6036 poll_schedule_timeout s 00:13:31 00:00:00 sshd: oracle [priv]
root 58356 1 19 0.0 0.0 112756 4352 poll_schedule_timeout s aug 19 00:04:31 /usr/sbin/sshd -d
root 47981 58356 19 0.0 0.0 167728 6424 poll_schedule_timeout s nov 06 00:00:00 sshd: hgaqjc [priv]
root 47614 58356 19 0.0 0.0 167716 6424 poll_schedule_timeout s nov 04 00:00:00 sshd: hgaqjc [priv]
root 31839 58356 19 0.0 0.0 163388 6028 poll_schedule_timeout s jan 05 00:00:00 sshd: oracle [priv]
通过补丁的日志也能发现这个进程(82595)的父进程(82535),在安装补丁时显示仍在占用库文件libclntsh.so.19.1,同事kill了以后安装的补丁。
more /oracle/app/oracle/product/19c/db_1/cfgtoollogs/opatch/opatch2022-01-11_00-26-54am_1.log
由于进程仍然在使用共享内存,导致无法回收共享内存段,hugepage仍在占用,导致重启后无足够的hugepage使用,重新分配4k的内存页。
- 遗留一个疑问点,进程(82595)未正常退出的原因(oracle sr的回复)。
当时这个 server process 在执行一个 os kernel 函数 pipe_wait, 看起来它在执行os的 systemcall 并进入异常状态,
要进一步分析这个问题,必须当时就检查这个 process 执行的具体的 kernel level 的完整的 callstack:
使用root执行
# cat /proc//stack <==替换这里的为当时出问题的进程
请您在下次发生类似问题时,收集未退出进程的 /proc//stack
btw: 我们查了一下,很多 hang 在 pipe_wait 上的的进程都是在做 os vfs io 操作。
总结
- 根据hugepage的变化总结本次故障
1. 数据库打补丁关库前(停库时间00:13:23),大页内存使用正常 zzz ***tue jan 11 00:12:38 cst 2022 hugepages_total: 192988 hugepages_free: 39808 hugepages_rsvd: 421 2. 数据库打补丁关库后,起库前(起库时间00:28:34),共享内存没有释放,导致 hugepage 没有释放 zzz ***tue jan 11 00:22:09 cst 2022 hugepages_total: 192988 hugepages_free: 39808 hugepages_rsvd: 421 3. 数据库打补丁起库后(起库时间00:28:34),剩余的大页内存正在被使用,因为剩余的大页不足以存放 sga,所以 sga 没有完全用上 hugepage,大部分使用的4k的页。 zzz ***tue jan 11 00:28:39 cst 2022 hugepages_total: 192988 hugepages_free: 38026 hugepages_rsvd: 37822 zzz ***tue jan 11 00:28:54 cst 2022 hugepages_total: 192988 hugepages_free: 16400 hugepages_rsvd: 16196 zzz ***tue jan 11 00:29:09 cst 2022 hugepages_total: 192988 hugepages_free: 869 hugepages_rsvd: 665 4. 在 01:13:58 左右,hugepage 被释放,但是 oracle 用不上。 zzz ***tue jan 11 01:13:58 cst 2022 hugepages_total: 192988 hugepages_free: 154049 hugepages_rsvd: 244 同时82595进程也是在这个时间点消失的,脚本 grep 82595 tue_jan_11_01:13:43_cst_2022 1 tue_jan_11_01:13:58_cst_2022 0 5. 当数据库连接数高时页表占用过多内存,同时页表过大影响性能,最后内存不足,导致io无法完成,节点驱逐。 zzz ***tue jan 11 05:16:06 cst 2022 hugepages_total: 192988 hugepages_free: 192988 hugepages_rsvd: 0 6. 2021-01-11 05:17:22 数据库再次重启时,几乎全部分配的2m内存页(hugepage),恢复了正常 zzz ***tue jan 11 05:19:37 cst 2022 hugepages_total: 192988 hugepages_free: 39810 hugepages_rsvd: 423 2022-01-11t05:17:22.778865 08:00 available system pagesizes: 4k, 2048k 2022-01-11t05:17:22.778981 08:00 supported system pagesize(s): 2022-01-11t05:17:22.779041 08:00 pagesize available_pages expected_pages allocated_pages error(s) 2022-01-11t05:17:22.779104 08:00 4k configured 14 14 none 2022-01-11t05:17:22.779226 08:00 2048k 192988 153601 153601 none
- 总结下数据库启停流程
1. 确认主机
ip a
uname -a
2. 确认数据库和spfile
$ export oracle_sid=
$ export oracle_home=
$ sqlplus / as sysdba
sql> show parameter name
sql> show parameter spfile
3. 注释 crontab ,注意root用户下是否也存在数据库相关的crontab
$ crontab –l > /home/oracle/crontab_bak_xxxx
$ crontab –e --清空
4. 检查活动会话,是否存在类似备份的任务,提前处理掉
sql> @ase.sql
5. 关闭监听
su - grid
srvctl stop listener -node rac2
6. 杀会话
su - oracle
ps -ef|grep "local=no"|grep $oracle_sid |grep -v grep |wc -l
ps -ef|grep "local=no"|grep $oracle_sid |grep -v grep |awk '{print "kill -9 " $2}'|sh
7. 检查是否有未提交事物
select
s.username,
t.xidusn,
t.xidslot,
t.xidsqn,
x.ktuxesiz
from
sys.x$ktuxe x,
sys.v_$transaction t,
sys.v_$session s
where
x.inst_id = userenv('instance') and
x.ktuxesta = 'active' and
x.ktuxesiz =1 and
t.xidusn = x.ktuxeusn and
t.xidslot = x.ktuxeslt and
t.xidsqn = x.ktuxesqn and
s.saddr = t.ses_addr;
8. 检查session,切日志做checkpoint
select inst_id, sessions_current,sessions_highwater from gv$license;
@kill type='user'
alter system switch logfile;
alter system archive log current;
alter system checkpoint;
9. 关闭数据库,跟踪 alert log,是否异常输出
shutdown immediate
如果使用srvctl关闭数据库,注意加参数,默认是abort
srvctl stop instance -db orcl -instance orcl1 -stopoption immediate
srvctl stop database -db orcl -stopoption immediate
10. 确认共享内存段已释放,是否存在残留进程
cat /proc/meminfo --检查大页内存已释放
ipcs -m
对于一个主机上多个实例时,无法从共享段区分属于哪个实例使用,需要增加一步使用oracle的sysresv工具:
$oracle_home/bin/sysresv
如果实例关闭后共享内存在等待1-2分钟未自动释放,可以使用ipcrm手动释放共享内存段
ipcrm -m shmid //删除共享内存段
检查是否存在残留进程
ps -ef|grep local |grep $oracle_sid |grep -v grep
11. 启动数据库,跟踪 alert log
sql> startup
或者使用srvctl
srvctl start instance -db orcl -instance orcl1
srvctl start database -db orcl
2022-01-10t23:54:29.121877 08:00
available system pagesizes:
4k, 2048k
2022-01-10t23:54:29.121989 08:00
supported system pagesize(s):
2022-01-10t23:54:29.122047 08:00
pagesize available_pages expected_pages allocated_pages error(s)
2022-01-10t23:54:29.122105 08:00
4k configured 14 14 none
2022-01-10t23:54:29.122474 08:00
2048k 192475 153601 153601 none
12. 起库后的检查
检查是否可以切换日志
alter system switch logfile;
检查监听
lsnrctl status
检查ogg,检查dataguard
最后修改时间:2022-01-14 10:21:52
【米乐app官网下载的版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。