2022-10-04 Mongo database failure
Date | Oct 4, 2022 |
Authors | @Joseph (Pepe) Kelly @Reuben Roberts |
Status | Done |
Summary |
|
Impact | 1 day delay in the “Notifications” going to trusts. Slight chance that some TIS records have not been ESR |
Non-technical Description
Mongo is the database that is used by the TIS ESR services to store trainee data coming into TIS from ESR, and to keep record of the notifications of trainee data changes that TIS in turn sends to ESR. When that database fails, the ESR services cannot function. Trainee data is not lost, but the communication between TIS and ESR is disrupted. ESR did not send us files between 1st-3rd, so they sent us several days worth of files over 2 days. This did not cause an immediate problem. The load hit an unmanagable level while generating notifications for ESR.
There were no notification files generated on the 4th. The “pending notifications” which should have been exported were aggregated and exported into the following day, 5th October.
Trigger
Accumulated Server load
Detection
Messages in Slack
monitoring-prod
channel.
Resolution
Restarted server, checked for files that should have been sent to ESR were exported. Notification files were not but the files generated on Wednesday were significantly larger than earlier in the week and in the same week last year.
Timeline
UTC unless otherwise stated
Oct 4, 2022 13:02 - First alert about high levels of memory usage.
Oct 4, 2022 13:07 - Alert about database unavailability.
Oct 4, 2022 13:12-13:20 - Failed attempt to gracefully restart services.
Oct 4, 2022 13:20-13:37 - Hard Restart of server.
Oct 6, 2022 - Confirmation of files which were not generated.
Root Cause(s)
Server became unresponsive.
Action Items
Action Items | Owner |
---|---|
Adjust threshold for alerting to prefer more unnecessary restarts. | @Joseph (Pepe) Kelly [Done] |
|
|
Lessons Learned
We didn’t pick up on the significantly busier time of year / discuss any actions after Wed 4th May.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213