2021-12-09 DMS tasks not been running
Date | Dec 9, 2021 |
Authors |
|
Status | Documenting |
Summary | DMS tasks have been down for a couple of months (since mid October) and were brought back up only today |
Impact | No data changes synchronized over to TSS in those two months |
Non-technical Description
DMS was not running on either prod and preprod. Changes in the TIS databases meant to be captured and synchronized to TSS must not have been captured while DMS was down.
Trigger
Mysql server not being updated to anticipate DMS’s change of address
Detection
@Andy Dingley noticed the tasks were not running on both preprod and prod
Resolution
Whitelisting of DMS’s new address on the Mysql server
Timeline
Oct 15, 2021 (approximate) - DMS tasks stop working
Dec 9, 2021 14:10 GMT - Andy finds the the DMS tasks are not running
Dec 9, 2021 15:30 GMT - Tasks restarted successfully after the whitelisting of DMS addresses
Dec 23, 2021 14:20 GMT - Ticket opened with AWS Support to get information on why the DMS addresses were changed
Dec 23, 2021 18:50 GMT - Response from AWS Support
Root Cause(s)
DMS’s address change
Why did the DMS address change (& when might it happen again?)
Does this mean it’s a firewall thing? MySQL user too? Is there a more dynamic way that we can set this?
Use AWS Secrets Manager instead?
AWS Support mentioned that a host replacement occurred on both preprod/prod Replication Instances on Dec 13, 2021 - AWS Support was unable to get access to the records related to our DMS service issues back in October as the process logs are only kept for a limited time however there is a good chance that a host replacement also occurred in October. The public IPs would have been changed when the hosts were replaced.
Action Items
Action Items | Owner |
---|---|
Add monitoring to DMS | |
Mitigate to prevent this from happening in the future | |
|
|
|
|
Lessons Learned
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213