Date

Authors

Rob Pink Steven Howard Joseph (Pepe) Kelly Tobi Olufuwa

Status

Done

Summary

Some users experiencing issues with TIS, describing it as “crashing and slow loading over a couple of days”.

Impact

Users unable to use TIS

Non-technical Description

Users on TIS reported that the application was very slow and occasionally crashing when attempting to retrieve or update data. This was the result of proxy errors being returned from the TCS enpoints. image-20241010-095351.png


Trigger

A spike in incoming requests caused the servers hosting key services to exceed their CPU capacity, leading to an inability to process these requests in a timely manner.


Detection

Notified by Users on Teams channel


Resolution


Timeline

All times BST unless otherwise indicated.

fnfb-502.PNG

5 Whys (or other analysis of Root Cause)

Q. Why was TIS responding slowly or crashing?
A. TCS endpoints were not returning data and timing out with proxy errors

Q. Why was TCS erroring and not returning data
A. TCS was unable to process the requests in a timely manner due to high CPU utilization and memory.

Q. Why were there not enough tasks to handle the increased load?
A. Auto-scaling was not enabled, so the system did not automatically scale up the number of tasks in response to the surge in traffic.

Q. Why were there errors on sentry-esr?
A. TCS was unresponsive


Action Items

Action Items

Owner

Implement Auto-Scaling

& Enhance Monitoring and Alerts

Tobi Olufuwa

See also:


Lessons Learned