3

mogdb主备集群的一次异常断电启动失败故障处理 -m6米乐安卓版下载

原创 黄超 2022-09-13
1260

部门的测试环境,mogdb的一主一备集群,版本是2.1.1。双机在断电之前,mogdb集群正常,断电重启后,操作系统启动正常,启动mogdb集群,启动失败。
主机ip:192.168.137.110
备机ip:192.168.137.111

$ gs_om -t start
starting cluster.
=========================================
=========================================
[gauss-53600]: can not start the database, the cmd is . /home/omm/.bashrc; python3 '/dbdata/app/tools/script/local/startinstance.py' -u omm -r /dbdata/app/mogdb -t 300 --security-mode=off,  error:
[failure] master:
[gauss-51607] : failed to start instance. error: please check the gs_ctl log for failure details.
[2022-09-08 17:47:43.905][1718][][gs_ctl]: gs_ctl started,datadir is /dbdata/data/db1 
[2022-09-08 17:47:51.180][1718][][gs_ctl]: waiting for server to start...
.0 log:  [alarm module]can not read gauss_warning_type env.
	
0 log:  [alarm module]host name: master 
	
0 log:  [alarm module]host ip: 192.168.137.110 
	
0 log:  [alarm module]cluster name: dbcluster 
	
..2022-09-08 17:47:53.433 6319ba47.1 [unknown] 140561613260352 [unknown] 0 dn_6001_6002 db010  0 [redo] log:  recovery parallelism, cpu count = 2, max = 4, actual = 2
2022-09-08 17:47:53.433 6319ba47.1 [unknown] 140561613260352 [unknown] 0 dn_6001_6002 db010  0 [redo] log:  configrecoveryparallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
failed to read gaussdb.state: 0failed to set gaussdb.state with unknown_state[2022-09-08 17:47:54.185][1718][][gs_ctl]: waitpid 1722 failed, exitstatus is 256, ret is 2
[2022-09-08 17:47:54.185][1718][][gs_ctl]: stopped waiting
[2022-09-08 17:47:54.185][1718][][gs_ctl]: could not start server
examine the log output.[failure] standby:
[gauss-51607] : failed to start instance. error: please check the gs_ctl log for failure details.
[2022-09-08 17:48:00.805][1344][][gs_ctl]: gs_ctl started,datadir is /dbdata/data/db1 
[2022-09-08 17:48:02.935][1344][][gs_ctl]: waiting for server to start...
.0 log:  [alarm module]can not read gauss_warning_type env.
	
0 log:  [alarm module]host name: standby 
	
0 log:  [alarm module]host ip: 192.168.137.111 
	
0 log:  [alarm module]cluster name: dbcluster 
	
2022-09-08 17:48:03.632 6319ba53.1 [unknown] 139726745114176 [unknown] 0 dn_6001_6002 db010  0 [redo] log:  recovery parallelism, cpu count = 2, max = 4, actual = 2
2022-09-08 17:48:03.632 6319ba53.1 [unknown] 139726745114176 [unknown] 0 dn_6001_6002 db010  0 [redo] log:  configrecoveryparallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
failed to read gaussdb.state: 0failed to set gaussdb.state with unknown_state[2022-09-08 17:48:03.937][1344][][gs_ctl]: waitpid 1347 failed, exitstatus is 256, ret is 2
[2022-09-08 17:48:03.937][1344][][gs_ctl]: stopped waiting
[2022-09-08 17:48:03.937][1344][][gs_ctl]: could not start server
examine the log output.

1.查看集群状况

  #su - omm
$ gs_om -t status --detail
[   cluster state   ]
cluster_state   : unavailable
redistributing  : no
current_az      : az_all
[  datanode state   ]
    node   node_ip         port      instance                 state
-----------------------------------------------------------------------------------
1  master  192.168.137.110 26000      6001 /dbdata/data/db1   p down    manually stopped
2  standby 192.168.137.111 26000      6002 /dbdata/data/db1   s down    manually stopped

2.查看数据库版本

$ gs_ctl --v
gs_ctl (opengauss) 9.2.4
$ mogdb -v
gaussdb (mogdb 2.1.1 build b5f25b20) compiled at 2022-03-21 14:42:30 commit 0 last mr 

3.查看日志

#查询日志目录

cat /dbdata/data/db1/postgresql.conf |grep -i log_dir
log_directory = '/dbdata/log/omm/pg_log/dn_6001'		# directory where log files are written,

查看日志列表

  cd /dbdata/log/omm/pg_log/dn_6001
ls -l
-rw-------  1 omm dbgrp 100076 sep  8 15:47 postgresql-2022-09-08_144356.log
-rw-------  1 omm dbgrp      0 sep  8 17:04 postgresql-2022-09-08_170442.log

最新日志已经不打印。

4.查看官方手册

根据错误码查看官方手册,[gauss-53600]和[gauss-51607]

gauss-53600: "ca password must contain at least eight characters."
sqlstate: 无
错误原因: 系统内部错误。
解决办法: 请联系m6米乐安卓版下载的技术支持工程师提供m6米乐安卓版下载的技术支持。
gauss-51607: "failed to start %s."
错误原因: 启动集群/节点/实例失败。
解决办法: 1.检查网络连接是否正常;2.检查配置文件是否正确。

5.查看源码

报错里面提到文件gaussdb.state,在官方手册搜gaussdb.state,没有发现主题。
根据报错“failed to read gaussdb.state”语句,查看官方源码,找到相关代码
postmaster.cpp

  /*
 * only update gaussdb.state file's state field.
 *
 * parameters:
 *      state: input new state
 * return:
 *      true if success, otherwise false.
 *
 * note: unsafe function is not expected here since it is referred in signal handler.
 */
bool setdbstatefilestate(dbstate state, bool optional)
{
    /* do nothing while core dump be appeared so early. */
    if (strlen(gaussdb_state_file) > 0) {
        char temppath[maxpgpath] = {0};
        gaussstate s;
        int len = 0;
        /* zero it in case gaussdb.state doesn't exist. */
        int rc = memset_s(&s, sizeof(gaussstate), 0, sizeof(gaussstate));
        securec_check_c(rc, "\0", "\0");
        rc = snprintf_s(temppath, maxpgpath, maxpgpath - 1, "%s.temp", gaussdb_state_file);
        securec_check_intval(rc, , false);
        /* write the new content into a temp file and rename it at last. */
        int fd = open(gaussdb_state_file, o_rdonly);
        if (fd == -1) {
            if (errno == enoent && optional) {
                write_stderr("gaussdb.state does not exist, and skipt setting since it is optional.");
                return true;
            } else {
                write_stderr("failed to open gaussdb.state.temp: %d", errno);
                return false;
            }
        }
        /* read old content from file. */
        len = read(fd, &s, sizeof(gaussstate));
        /* sizeof(int) is for current_connect_idx of gaussstate */
        if ((len != sizeof(gaussstate)) && (len != sizeof(gaussstate) - sizeof(int))) {
            write_stderr("failed to read gaussdb.state: %d", errno);
            (void)close(fd);
            return false;
        }

在源码文件postmaster.cpp里面发现代码函数setdbstatefilestate。在启动mogdb的时候,会通过读取gaussdb.state来设置数据库运行状态,而在读取gaussdb.state的字节长度大小比较失败,输出错误,返回false,终止启动。

1.查看gaussdb.state

cd /dbdata/data/db1/
ll gaussdb.state
-rw-------  1 omm dbgrp       0 sep  8 17:04 gaussdb.state

权限和属组正常,但是文件大小0异常。

  cat gaussdb.state

返回空

  vi gaussdb.state

返回空

2.替换gaussdb.state

  rm -f gaussdb.state

从另外的mogdb正常环境复制一个gaussdb.state到主机和备机

  cp gaussdb.state
ll gaussdb.state
-rw-r--r-- 1 root root      72 sep  9 15:14 gaussdb.state
chown omm.dbgrp gaussdb.state
ll gaussdb.state
-rw-r--r-- 1 omm dbgrp      72 sep  9 15:24 gaussdb.state

查看正常的gaussdb.state

  cat gaussdb.state

返回空

  vi gaussdb.state
^b^@^@^@^a^@^@^@^a^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

3.启动集群,并查看集群状态

  gs_om -t start
starting cluster.
=========================================
[success] master
[success] standby
=========================================
successfully started.
gs_om -t status --detail
[   cluster state   ]
cluster_state   : normal
redistributing  : no
current_az      : az_all
[  datanode state   ]
    node   node_ip         port      instance                 state
-----------------------------------------------------------------------------------
1  master  192.168.137.110 26000      6001 /dbdata/data/db1   p primary normal
2  standby 192.168.137.111 26000      6002 /dbdata/data/db1   s standby normal

1.误删gaussdb.state是否可以正常启动?

删除文件gaussdb.state

  rm -f gaussdb.state

启动数据库

gs_ctl -d /dbdata/data/db1/ start

启动成功,新生成一个gaussdb.state

2.修改gaussdb.state里面内容是否可以正常启动?

vi gaussdb.state

清空已有内容,随便插入几个数字,保存
启动数据库

gs_ctl -d /dbdata/data/db1/ start

重现上面故障

清除非法内容,插回字符长串,保存

^b^@^@^@^a^@^@^@^a^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

启动数据库成功

1.数据库要保障正常启动和关停,保障供电正常,忌突然断电,容易造成数据文件损坏,数据库异常。
2.数据库无法启动,通过报错或者错误日志分析原因,可以查询官方手册,可以官方源码搜关键字词等

mogdb官方手册
opengauss源码地址:

最后修改时间:2022-09-14 10:23:51
「喜欢这篇文章,您的关注和赞赏是给作者最好的鼓励」
关注作者
1人已赞赏
【米乐app官网下载的版权声明】本文为墨天轮用户原创内容,转载时必须标注文章的来源(墨天轮),文章链接,文章作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

文章被以下合辑收录

评论

网站地图