2022-10-04 Mongo database failure

Date

Oct 4, 2022

Authors

@Joseph (Pepe) Kelly @Reuben Roberts

Status

Done

Summary

 

Impact

1 day delay in the “Notifications” going to trusts. Slight chance that some TIS records have not been ESR

Non-technical Description

Mongo is the database that is used by the TIS ESR services to store trainee data coming into TIS from ESR, and to keep record of the notifications of trainee data changes that TIS in turn sends to ESR. When that database fails, the ESR services cannot function. Trainee data is not lost, but the communication between TIS and ESR is disrupted. ESR did not send us files between 1st-3rd, so they sent us several days worth of files over 2 days. This did not cause an immediate problem. The load hit an unmanagable level while generating notifications for ESR.

There were no notification files generated on the 4th. The “pending notifications” which should have been exported were aggregated and exported into the following day, 5th October.

Trigger

  • Accumulated Server load

Detection

  • Messages in Slack monitoring-prod channel.


Resolution

  • Restarted server, checked for files that should have been sent to ESR were exported. Notification files were not but the files generated on Wednesday were significantly larger than earlier in the week and in the same week last year.


Timeline

UTC unless otherwise stated

  • Oct 4, 2022 13:02 - First alert about high levels of memory usage.

  • Oct 4, 2022 13:07 - Alert about database unavailability.

  • Oct 4, 2022 13:12-13:20 - Failed attempt to gracefully restart services.

  • Oct 4, 2022 13:20-13:37 - Hard Restart of server.

  • Oct 6, 2022 - Confirmation of files which were not generated.

 


Root Cause(s)

  • Server became unresponsive.


Action Items

Action Items

Owner

Action Items

Owner

Adjust threshold for alerting to prefer more unnecessary restarts.

@Joseph (Pepe) Kelly [Done]

 

 


Lessons Learned

  • We didn’t pick up on the significantly busier time of year / discuss any actions after Wed 4th May.