2022-04-22 ESR integration database went down
Date | Apr 22, 2022 |
Authors | @Joseph (Pepe) Kelly |
Status | Documenting |
Summary | The store of information that helps ESR & TIS integrate became unavailable |
Impact | The exchange of information between TIS & ESR was disrupted. We will make some checks for missing or erroneous data as a result of the outage. |
Non-technical Description
There is a virtual computer (server) that runs the database. This stopped responding to requests and was flagged by our monitoring systems as having disappeared (similar to having crashed). We restarted the server and checked indicators that normal service had resumed.
We made some checks for data that may have been ‘dropped’ or not handled correctly in the ~15 minutes that the database was unavailable. We resubmitted messages for 2 “New post” notifications which were then sent to ESR on 23rd April.
Trigger
Detection
Monitoring alert in slack.
Attempts to manually connect to the server
Resolution
Restarted the server
TODO: Check audit information/DLQ for failed messages.
Timeline
BST unless otherwise stated
Apr 22, 2022 12:28:40 - Disk IOPs spike to 328 ops/sec
Apr 22, 2022 12:29:48.244 - Connection between cluster nodes failed
Apr 22, 2022 12:29:55.763 - container healthcheck failure
Apr 22, 2022 12:31 BST - Alert message in slack
Apr 22, 2022 12:38.822 to 12:40:40.159 - Service recovering to a functional state
Apr 22, 2022 15:22 - Resubmitted Dead Letter messages for new posts: KSS/FRIM/ITP/800/30162 and KSS/FRIM/ITP/800/30163
Root Cause(s)
Nothing more discovered beyond previous outages.
Memory and disk reads spiked.
Action Items
Action Items | Owner |
---|---|
| @Joseph (Pepe) Kelly |
Lessons Learned
Despite a planned move to a DBaaS platform, there are still things that might be valuable in the weeks/months before that is in place.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213