【记一次mysql主从复制延迟问题分析】 -m6米乐安卓版下载

原创张sir 2022-10-19

1241

mysql数据库的主从延迟问题在生产上还是很经常出现的，从mysql5.7以后开始出现基于write set的并行复制，主从延迟的问题得到的了大大的缓解，但是由于开发在使用mysql的过程中，并不了解主从复制的原理，也不会遵循mysql一些最佳实践，导致主从延迟的问题依然大量存在。本文是生产上处理的一次主从延迟问题。文末总结给出了解决主从延迟的几个关键点。

对于主从复制延迟的问题，我的分析思路是这样的：

通过show slave status确认，延迟binlog的点，主要查看 relay_log_file、relay_log_pos，重点关注relay_log_pos是否变化。
如果relay_log_pos一直不变，说明可能是遇到了大事务。可以解析相关的binlog、relaylog看看对哪个表在操作。
从binlog/relaylog中找到表名，可以确定这个操作是delete还是update还是insert。
再去相应的主库的慢日志中，根据对应的时间点，找是否有相关慢查询。此时可以确定该sql的具体内容，看看是否需要优化sql。
如果第2步查到的relay_log_pos一直在变，说明备库一直在回放事务，那么可以看看操作系统的io情况是不是给力，是否卡其了并行。

1、登录备库查看数据库的复制情况，主从延迟已经达到了2269592s，且relay_log_pos保持不变，应该是遇到大事务了。

                connect_retry: 60
              master_log_file: mysql-bin.002211
          read_master_log_pos: 51704210
               relay_log_file: mysqld-relay-bin.006395
                relay_log_pos: 13999474
        relay_master_log_file: mysql-bin.002134                
             slave_io_running: yes
            slave_sql_running: yes
              replicate_do_db: 
	      master_ssl_cert: 
            master_ssl_cipher: 
               master_ssl_key: 
        seconds_behind_master: 2269592    
master_ssl_verify_server_cert: no
                last_io_errno: 0
                last_io_error: 
               last_sql_errno: 0
               last_sql_error:

2、利用mysqlbinlog解析当前的relaylog找到当前操作的表，确认为对表req_log的delete操作，时间为210422 2:18:07s发起的事务。


# at 13999474
#210422  2:18:07 server id 2  end_log_pos 13999395 crc32 0x755796e1     query   thread_id=107738405     exec_time=0     error_code=0
set timestamp=1619029087/*!*/;
set @@session.pseudo_thread_id=107738405/*!*/;
set @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
set @@session.sql_mode=1075838976/*!*/;
set @@session.auto_increment_increment=2, @@session.auto_increment_offset=2/*!*/;
/*!\c utf8 *//*!*/;
set @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
set @@session.time_zone='system'/*!*/;
set @@session.lc_time_names=0/*!*/;
set @@session.collation_database=default/*!*/;
begin
/*!*/;
# at 13999559
#210422  2:18:07 server id 2  end_log_pos 13999481 crc32 0x651ea915     table_map: `abc`.`req_log` mapped to number 286931
# at 13999645
#210422  2:18:07 server id 2  end_log_pos 14007611 crc32 0xcd1abf13     delete_rows: table id 286931
# at 14007775
#210422  2:18:07 server id 2  end_log_pos 14015737 crc32 0x408739cf     delete_rows: table id 286931
# at 14015901
#210422  2:18:07 server id 2  end_log_pos 14023927 crc32 0x855917e7     delete_rows: table id 286931
# at 14024091
#210422  2:18:07 server id 2  end_log_pos 14032103 crc32 0x9adfe391     delete_rows: table id 286931

3、根据表名和对应事务的时间点去主库慢日志，看能否匹配到相关的sql。找到了一个delete语句。在主库执行了22s。

# time: 210422  2:18:30
# user@host: abc[abc] @  [21.123.119.129]  id: 107738405
# query_time: 22.972652  lock_time: 0.000143 rows_sent: 0  rows_examined: 6241045 logical_reads: 0 physical_reads: 0
set timestamp=1619029110;
delete from req_log
where date(sys_date_time)<=date_sub(curdate(),interval 3 day);

4、看下该delete语句的执行计划是全表扫描。

explain select * from req_log
where date(sys_date_time)<=date_sub(curdate(),interval 3 day);
 ---- ------------- --------- ------ --------------- ------ --------- ------ --------- ------------- 
| id | select_type | table   | type | possible_keys | key  | key_len | ref  | rows    | extra       |
 ---- ------------- --------- ------ --------------- ------ --------- ------ --------- ------------- 
|  1 | simple      | req_log | all  | null          | null | null    | null | 4231103 | using where |
 ---- ------------- --------- ------ --------------- ------ --------- ------ --------- -------------

5、看下表的数据量，从统计信息上看，大概有4百万。

mysql> select table_name,table_rows,data_length/1024/1024/1024 from tables where table_name in ('req_log');
 ------------ ------------ ---------------------------- 
| table_name | table_rows | data_length/1024/1024/1024 |
 ------------ ------------ ---------------------------- 
| req_log    |    4231227 |             0.876449584961 |
 ------------ ------------ ---------------------------- 
``
6、看下这个表是否有索引。没有任何索引。
```language
mysql> show create table req_log\g
*************************** 1. row ***************************
       table: req_log
create table: create table `req_log` (
  `id` varchar(40) default null,
  `app_id` varchar(8) default null,
  `res_json_param` text comment '响应的json参数',
  `req_type` varchar(10) default null comment '请求类型',
  `req_json_param` text comment '请求的json参数',
  `sys_date_time` datetime default null comment '系统时间',
  `attachments_path` varchar(100) default null comment '附件路径',
  `result_type` varchar(10) default null comment '结果类型',
  `api_name` varchar(255) default null comment '调用接口名称',
  `ask_resp_time` varchar(255) default null,
  `tts_resp_time` varchar(255) default null,
  `rtp_resp_time` varchar(255) default null,
  `srp_resp_time` varchar(255) default null
) engine=innodb default charset=utf8
/*!50100 partition by range (month(sys_date_time))
(partition p1 values less than (2) engine = innodb,
 partition p2 values less than (3) engine = innodb,
 partition p3 values less than (4) engine = innodb,
 partition p4 values less than (5) engine = innodb,
 partition p5 values less than (6) engine = innodb,
 partition p6 values less than (7) engine = innodb,
 partition p7 values less than (8) engine = innodb,
 partition p8 values less than (9) engine = innodb,
 partition p9 values less than (10) engine = innodb,
 partition p10 values less than (11) engine = innodb,
 partition p11 values less than (12) engine = innodb,
 partition p12 values less than (13) engine = innodb) */
1 row in set (0.00 sec)

7、经过以上分析，发下出现问题的是一个delete语句，而且被清理的表上没有索引，也没有主键。那为什么我们要关注索引和主键呢，这里讲一下，mysql的binlog在row模式下的记录的格式，对于一个delete或者update语句：delete from big_table where id<10;binlog里实际上记录的是如下的形式，也就是说如果主库清理掉10条记录，日志就会记录10条delete，而且这个delete后面的where条件带上了所有的字段值。如果表没有索引或者主键，在主库执行一次的全表扫描，备库就要执行10次的全表扫描。这对于批量的清理或者更新就很恐怖了，在主库执行1w次的清理，你可能感觉不太慢，那么备库在重放的时候可能麻烦了。

### delete from `test`.`big_table`
### where
###   @1=10 /* int meta=0 nullable=0 is_null=0 */
###   @2=3 /* int meta=0 nullable=1 is_null=0 */
###   @3='d3d9446802a44259755d38e6d163e820' /* varstring(128) meta=128 nullable=1 is_null=0 */
###   @4=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @5=null /* int meta=0 nullable=1 is_null=1 */
###   @6=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @7=null /* int meta=0 nullable=1 is_null=1 */
###   @8=null /* int meta=0 nullable=1 is_null=1 */
###   @9=null /* int meta=0 nullable=1 is_null=1 */
### delete from `test`.`big_table`
### where
###   @1=11 /* int meta=0 nullable=0 is_null=0 */
###   @2=4 /* int meta=0 nullable=1 is_null=0 */
###   @3='6512bd43d9caa6e02c990b0a82652dca' /* varstring(128) meta=128 nullable=1 is_null=0 */
###   @4=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @5=null /* int meta=0 nullable=1 is_null=1 */
###   @6=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @7=null /* int meta=0 nullable=1 is_null=1 */
###   @8=null /* int meta=0 nullable=1 is_null=1 */
###   @9=null /* int meta=0 nullable=1 is_null=1 */
### delete from `test`.`big_table`
### where
###   @1=13 /* int meta=0 nullable=0 is_null=0 */
###   @2=6 /* int meta=0 nullable=1 is_null=0 */
###   @3='c51ce410c124a10e0db5e4b97fc2af39' /* varstring(128) meta=128 nullable=1 is_null=0 */
###   @4=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @5=null /* int meta=0 nullable=1 is_null=1 */
###   @6=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @7=null /* int meta=0 nullable=1 is_null=1 */
###   @8=null /* int meta=0 nullable=1 is_null=1 */
###   @9=null /* int meta=0 nullable=1 is_null=1 */
### delete from `test`.`big_table`
### where
###   @1=14 /* int meta=0 nullable=0 is_null=0 */
###   @2=0 /* int meta=0 nullable=1 is_null=0 */
###   @3='aab3238922bcc25a6f606eb525ffdc56' /* varstring(128) meta=128 nullable=1 is_null=0 */
###   @4=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @5=null /* int meta=0 nullable=1 is_null=1 */
###   @6=null /* varstring(40) meta=40 nullable=1 is_null=1 */
###   @7=null /* int meta=0 nullable=1 is_null=1 */
###   @8=null /* int meta=0 nullable=1 is_null=1 */
###   @9=null /* int meta=0 nullable=1 is_null=1 */

8、基于以上的分析，对于这种情况，我们建议表上都加上主键，没有业务字段可以做主键，可以添加自增列。再不济就加个索引，也比啥也没有强。我们的这个问题由于延迟实在是太久了，即使现在加上了主键或者索引，短期内也无法同步到备库。我们建议先在表上加索引或主键，然后再重做备机，这样备库可以马上就用上索引。

1、对于mysql种使用innodb的表来说，一定要加上主键，这也是很多大公司的硬性要求。
2、对于批量操作，建议分批提交，减少事务量。
3、无法添加主键的，建议添加选择性好的唯一索引或者普通索引。
4、如果以上均暂时无法满足，可以尝试调整参数slave_rows_search_algorithms，这个参数指定了备库在扫描表的是时候的算法，默认是table_scan,index_scan，可以调整成hash_scan,table_scan。
5、如果对于大量的insert操作导致的备库延迟，可以尝试临时调整innodb_flush_log_at_trx_commit和sync_binlog设置成非1，提高io的利用率。
6、以上操作均在生产环境做过实施，安全性还是可以保证的。
7、文章https://www.modb.pro/db/47411?utm_source=index_ai分析了一些源码级别的解释，大家可以看看。

墨力计划故障处理

最后修改时间：2023-03-22 13:48:51

「喜欢这篇文章，您的关注和赞赏是给作者最好的鼓励」

关注作者

【记一次mysql主从复制延迟问题分析】 -m6米乐安卓版下载

评论