Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Date

Authors

Joseph (Pepe) Kelly Reuben Roberts

Status

Documenting

Summary

Impact

1 day delay in the “Notifications” going to trusts. Slight chance that some TIS records have not been ESR

Non-technical Description

Mongo is the database that is used by the TIS ESR services to store trainee data coming into TIS from ESR, and to keep record of the notifications of trainee data changes that TIS in turn sends to ESR. When that database fails, the ESR services cannot function. Trainee data is not lost, but the communication between TIS and ESR is disrupted. ESR did not send us files between 1st-3rd, so they sent us several days worth of files over 2 days. This did not cause an immediate problem. The load hit an unmanagable level while generating notifications for ESR.

There were no notification files generated on the 4th. The “pending notifications” which should have been exported were aggregated and exported into the following day, 5th October.

Trigger

  • Accumulated Server load

Detection

  • Messages in Slack monitoring-prod channel.


Resolution

  • Restarted server, checked for files that should have been sent to ESR were exported. Notification files were not but the files generated on Wednesday were significantly larger than earlier in the week and in the same week last year.


Timeline

UTC unless otherwise stated

  • 13:02 - First alert about high levels of memory usage.

  • 13:07 - Alert about database unavailability.

  • 13:12-13:20 - Failed attempt to gracefully restart services.

  • 13:20-13:37 - Hard Restart of server.

  • - Confirmation of files which were not generated.


Root Cause(s)

  • Server became unresponsive.


Action Items

Action Items

Owner

Adjust threshold for alerting to prefer more unnecessary restarts.

Joseph (Pepe) Kelly


Lessons Learned

  • We didn’t pick up on the significantly busier time of year / discuss any actions after Wed 4th May.

  • No labels