CDC Monitoring - Problem transactions
If you see this, panic (a little and then do something):
If it’s not the job prod-cdc (last part of the “Description”), no need to panic but do sort it out as stage/dev ESR and TIS will be drifting out of sync.
It helps to understand Change Data Capture (CDC) | How . The first step is probably always going to be check the docker container on the machine listed.
If the logs contain something like the following, there is a problem processing a transaction:
...
2020-06-16 11:14:39 ERROR Error on bin log position Position[BinlogPosition[mysql-bin.003382:68853804], lastHeartbeat=1592034510892]
2020-06-16 11:14:39 INFO Binlog disconnected.
2020-06-16 11:14:44 WARN Timed out waiting for heartbeat 1592306079319
2020-06-16 11:14:44 INFO Stopping 4 tasks
...
The error line contains: BinlogPosition[mysql-bin.003382:68853804]
or in a generic format BinlogPosition[{binlog_file}:{binlog_position}]
To find (and skip over) a problem statement:
Decode the logfile
sudo mysqlbinlog --base64-output=decode -vv /var/log/mysql/{mysql-bin_file_from_error} > /tmp/someTempFile.sql
.Search for the position by the number in the error. From that point in the file you can find the next position by searching forward for
end_log_pos
. You may need to search past a few; finding a suitable point such as the end of the transaction.Update the
binlog_position
field inmaxwell
.positions
MySQL table to move the start point for CDC. I initially tried the first but this was also a problem. It may be that anything in the same transaction will also be an issue.
Misc thoughts:
It’s not appropriate to skip transactions in production without taking lots of mitigating actions (i.e. what to do about the transactions that haven’t been forwarded).
The recurring issue on STAGE may be caused by the weekly prod->stage synchronisation but the cdc logs (STAGE) showed the following error.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213