参数配置不当导致GreatSQL异步复制IO线程中断

张开发
2026/4/18 21:46:15 15 分钟阅读

分享文章

参数配置不当导致GreatSQL异步复制IO线程中断
参数配置不当导致GreatSQL异步复制IO线程中断1.问题发生某个客户的一套GreatSQL数据库异步复制的slave节点的IO线程无法启动报错信息如下2026-02-16T12:36:40.08662008:00 12452 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel :Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the masters binary log expiration period. The GTID set sent by the slave is 8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5, and the missing transaction is 8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718 (server errno1236) 2026-02-16T12:36:40.08666408:00 12452 [ERROR] [MY-013114] [Repl] Slave I/0 for channel: Got fatal error 1236 from master when reading data from binary log: Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the masters binary log expiration period. The GTID set sent by the slave is 8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5, and the missing transaction is 8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718, Error code: MY-013114此日志信息显示由于主库的binlog已清除IO线程中断。用户现场DBA进行了clone恢复。但问题原因究竟是什么呢为什么中间丢失了一个事务8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718呢2. 问题分析2.1 error.log解读仔细阅读error.log中的报错信息可以解读出如下信息slave在启动IO线程时向master节点发送的GTID事务集是8451fc4d-dlae-1lec-b273-525400fdb1cb:1-424857633, 8451fc4d-d1ae-11ec-b2bf-5254002b8928:1-19789717:19789719-20046740, 8653ccd7-d1ae-11ec-a24c-525400791c4c:1-5slave已经执行的GTID事务集不连续中间缺少了事务8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718。slave在缺少该事务8451fc4d-d1ae-11ec-b2bf-5254002b8928:19789718后执行了257021个事务。IO线程启动失败的原因是master节点中包含事务19789718的binlog已被清理。2.2 初步判断检查用户的binlog_expire_logs_seconds是6048007天并不短。根据GTID有空洞的现象判断与参数slave_skip_errors数相关检查用户的参数配置slave_skip_errors是ALL。参数slave_skip_errors默认OFF情况下当slave上发生错误时复制进程会停止让DBA有机会解决数据中的不一致问题。当配置slave_skip_errors配置错误类型时当slave上某条语句返回的错误属于该变量值中列出的任一错误类型复制SQL线程仍会继续执行。当slave_skip_errorsALL时slave上发生任何错误SQL线程都会继续执行。3. 实验验证3.1 实验设计搭建异步复制的环境slave端设置并行复制参数slave_skip_errorsall。在slave端制造事务阻塞让slave SQL线程执行失败。在slave端开启事务执行update语句不提交在master端执行修改同一行数据的update语句观察slave端事务的执行情况检查是否会出现GTID空洞。重启slave检查是否出现与用户相同的报错信息。3.2 实验1slave_skip_errorsALL模拟问题搭建异步复制的环境slave端设置并行复制参数slave_skip_errorsALL检查slave端的参数GreatSQL SELECT global.slave_parallel_workers,global.innodb_lock_wait_timeout,global.slave_skip_errors\G *************************** 1. row *************************** global.slave_parallel_workers: 64 global.innodb_lock_wait_timeout: 10 global.slave_skip_errors: ALL 1 row in set, 2 warnings (0.01 sec)在slave端开始事务执行update语句(test1的id是主键)不提交BEGIN; UPDATE test1 SET cdddddddddddd WHERE id2;在master端执行修改相同行的update语句该语句在slave端应用时会被阻塞。UPDATE test1 SET ceeeeeeeee WHERE id2;在master端使用sysbench执行其他事务。检查master端和slave端的gtid_execued-- master端 GreatSQL SELECT gtid_executed; ---------------------------------------------------------------------------------------- | gtid_executed | ---------------------------------------------------------------------------------------- | 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-84055 | ---------------------------------------------------------------------------------------- 1 row in set (0.00 sec) GreatSQL SELECT * FROM test1 WHERE id2; ----------------------------------------------------------------------------------- | id | k | c | pad | ----------------------------------------------------------------------------------- | 2 | 50248 | eeeeeeeee | 28733802923-10548894641-11867531929-71265603657-36546888392 | ----------------------------------------------------------------------------------- 1 row in set (0.00 sec) -- slave端 GreatSQL SELECT gtid_executed; -------------------------------------------------------------------------------------------------------------------------------------------- | gtid_executed | -------------------------------------------------------------------------------------------------------------------------------------------- | 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055, f8afd36c-12c1-11f1-a6e8-00163e892deb:1 | -------------------------------------------------------------------------------------------------------------------------------------------- 1 row in set (0.00 sec) GreatSQL SELECT * FROM test1 WHERE id2; -------------------------------------------------------------------------------------- | id | k | c | pad | -------------------------------------------------------------------------------------- | 2 | 50248 | dddddddddddd | 28733802923-10548894641-11867531929-71265603657-36546888392 | -------------------------------------------------------------------------------------- 1 row in set (0.00 sec)可见事务c12440a7-12c1-11f1-95ad-00163e379f11:40805没有在slave端执行slave端的GTID出现可空洞。检查slave端的error.log在error.log中有该事务的如下信息2026-02-27T09:26:49.53084308:00 29 [Note] [MY-010584] [Repl] Slave SQL for channel : Worker 18 failed executing transaction c12440a7-12c1-11f1-95ad-00163e379f11:43805 at master log binlog.000003, end_log_pos 266711465; Could not execute Update_rows event on table testdb.test1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the events master log binlog.000003, end_log_pos 266711465, Error_code: MY-001205可知由于锁等待超时事务c12440a7-12c1-11f1-95ad-00163e379f11:43805执行失败。检查slave的状态GreatSQL SHOW slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 172.17.138.128 Master_User: greatdb Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.000003 Read_Master_Log_Pos: 337004154 Relay_Log_File: relaylog.000004 Relay_Log_Pos: 337003634 Relay_Master_Log_File: binlog.000003 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 337004154 Relay_Log_Space: 337003837 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1006503306 Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:1-84055 Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055, f8afd36c-12c1-11f1-a6e8-00163e892deb:1 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: Master_public_key_path: Get_master_public_key: 0 Network_Namespace: 1 row in set, 1 warning (0.00 sec)可见io和sql线程都正常数据不一致的隐患未及时发现。master端purge binlog后重启slaveslave正常。GreatSQL SHOW slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 172.17.138.128 Master_User: greatdb Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.000004 Read_Master_Log_Pos: 237 Relay_Log_File: relaylog.000007 Relay_Log_Pos: 407 Relay_Master_Log_File: binlog.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 237 Relay_Log_Space: 860 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1006503306 Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Replica has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:1-84055 Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-84055, f8afd36c-12c1-11f1-a6e8-00163e892deb:1 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: Master_public_key_path: Get_master_public_key: 0 Network_Namespace: 1 row in set, 1 warning (0.00 sec)slave端reset slave后再start slaveio线程无法启动报错如下2026-02-27T11:04:19.16535508:00 31406 [ERROR] [MY-010557] [Repl] Error reading packet from server for channel : Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the masters binary log expiration period. The GTID set sent by the slave is 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1, and the missing transactions are c12440a7-12c1-11f1-95ad-00163e379f11:43805 (server_errno1236) 2026-02-27T11:04:19.16548608:00 31406 [ERROR] [MY-013114] [Repl] Slave I/O for channel : Got fatal error 1236 from master when reading data from binary log: Cannot replicate because the master purged required binary logs. Replicate the missing transactions from elsewhere, or provision a new slave from backup. Consider increasing the masters binary log expiration period. The GTID set sent by the slave is 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-43804:43806-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1, and the missing transactions are c12440a7-12c1-11f1-95ad-00163e379f11:43805, Error_code: MY-013114报错信息和用户现场的报错信息相同。3.3 实验2slave_skip_errorsOFF重复实验搭建异步复制的环境slave端设置并行复制参数slave_skip_errorsOFF检查slave端的参数GreatSQL SELECT global.slave_parallel_workers,global.innodb_lock_wait_timeout,global.slave_skip_errors\G *************************** 1. row *************************** global.slave_parallel_workers: 64 global.innodb_lock_wait_timeout: 10 global.slave_skip_errors: OFF 1 row in set, 2 warnings (0.01 sec)在slave端开始事务执行update语句不提交BEGIN; UPDATE test1 SET cfrom slave222 WHERE id2;在master端执行修改相同行的update语句该语句在slave端应用时会被阻塞。UPDATE test1 SET cfrom master1111 WHERE id2;在master端使用sysbench执行其他事务。检查slave的状态发现sql线程已停止。GreatSQL show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 172.17.138.128 Master_User: greatdb Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.000004 Read_Master_Log_Pos: 61931905 Relay_Log_File: relaylog.000004 Relay_Log_Pos: 411 Relay_Master_Log_File: binlog.000004 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1205 Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction c12440a7-12c1-11f1-95ad-00163e379f11:109034 at master log binlog.000004, end_log_pos 43620845. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Skip_Counter: 0 Exec_Master_Log_Pos: 43620411 Relay_Log_Space: 18312445 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1205 Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction c12440a7-12c1-11f1-95ad-00163e379f11:109034 at master log binlog.000004, end_log_pos 43620845. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any. Replicate_Ignore_Server_Ids: Master_Server_Id: 1006503306 Master_UUID: c12440a7-12c1-11f1-95ad-00163e379f11 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 260227 11:18:58 Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: c12440a7-12c1-11f1-95ad-00163e379f11:109034-119519 Executed_Gtid_Set: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-109033, f8afd36c-12c1-11f1-a6e8-00163e892deb:1 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: Master_public_key_path: Get_master_public_key: 0 Network_Namespace: 1 row in set, 1 warning (0.00 sec) GreatSQL SELECT * from performance_schema.replication_applier_status_by_worker limit 1\G *************************** 1. row *************************** CHANNEL_NAME: WORKER_ID: 1 THREAD_ID: NULL SERVICE_STATE: OFF LAST_ERROR_NUMBER: 1205 LAST_ERROR_MESSAGE: Worker 1 failed executing transaction c12440a7-12c1-11f1-95ad-00163e379f11:109034 at master log binlog.000004, end_log_pos 43620845; Lock wait timeout exceeded; try restarting transaction LAST_ERROR_TIMESTAMP: 2026-02-27 11:18:58.506075 LAST_APPLIED_TRANSACTION: LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000 LAST_APPLIED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000 LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000 LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP: 0000-00-00 00:00:00.000000 APPLYING_TRANSACTION: c12440a7-12c1-11f1-95ad-00163e379f11:109034 APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 2026-02-27 11:16:28.454244 APPLYING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 2026-02-27 11:16:28.454244 APPLYING_TRANSACTION_START_APPLY_TIMESTAMP: 2026-02-27 11:16:28.463083 LAST_APPLIED_TRANSACTION_RETRIES_COUNT: 0 LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 0 LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: LAST_APPLIED_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000 APPLYING_TRANSACTION_RETRIES_COUNT: 10 APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_NUMBER: 1205 APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_MESSAGE: Lock wait timeout exceeded; try restarting transaction APPLYING_TRANSACTION_LAST_TRANSIENT_ERROR_TIMESTAMP: 2026-02-27 11:18:43.502374 1 row in set (0.00 sec)可见由于锁等待超时事务c12440a7-12c1-11f1-95ad-00163e379f11:109034无法执行从而导致slave的sql线程异常退出可以有机会修复异常数据。slave端的事务提交或回滚后阻塞解除。启动slave事务开始正常应用。GreatSQL START slave; Query OK, 0 rows affected, 1 warning (0.48 sec) GreatSQL SHOW slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for source to send event Master_Host: 172.17.138.128 Master_User: greatdb Master_Port: 3306 Connect_Retry: 60 Master_Log_File: binlog.000004 Read_Master_Log_Pos: 61931905 Relay_Log_File: relaylog.000004 Relay_Log_Pos: 875819 Relay_Master_Log_File: binlog.000004 Slave_IO_Running: Yes Slave_SQL_Running: Yes ...... -- gtid已追齐 GreatSQL SELECT gtid_executed\G *************************** 1. row *************************** gtid_executed: 0b039ab6-38ac-350f-b814-0a612d2a2859:1-6, c12440a7-12c1-11f1-95ad-00163e379f11:1-119519, f8afd36c-12c1-11f1-a6e8-00163e892deb:1 1 row in set (0.00 sec) -- 数据已同步 GreatSQL SELECT * FROM test1 WHERE id2\G *************************** 1. row *************************** id: 2 k: 50248 c: from master1111 pad: 28733802923-10548894641-11867531929-71265603657-36546888392 1 row in set (0.00 sec)总结在生产数据库上要保持slave_skip_errors的默认值OFF避免主从数据不一致的风险。对于常见的错误也需要仔细观察从蛛丝马迹中找出问题的根本原因。

更多文章