2024-10-08 Users experiencing slow loading and crashing

Date

Oct 8, 2024

Authors

@Rob Pink @Steven Howard @Joseph (Pepe) Kelly @Tobi Olufuwa

Status

Done

Summary

Some users experiencing issues with TIS, describing it as “crashing and slow loading over a couple of days”.

Impact

Users unable to use TIS
https://hee-tis.atlassian.net/browse/TIS21-6610

Non-technical Description

Users on TIS reported that the application was very slow and occasionally crashing when attempting to retrieve or update data. This was the result of proxy errors being returned from the TCS enpoints. image-20241010-095351.png

 


Trigger

A spike in incoming requests caused the servers hosting key services to exceed their CPU capacity, leading to an inability to process these requests in a timely manner.


Detection

Notified by Users on Teams channel


Resolution

  • Replacing TCS tasks

  • Increased the TCS tasks memory from 2GB - 3GB


Timeline

All times BST unless otherwise indicated.

  • Oct 8, 2024 11:28 Numerous requests to TCS start responding with a “Bad Gateway” error. Logs show that all 3 tasks were receiving and processing some activity.

  • Oct 8, 2024 1229 user reported problem on Teams “TIS being slow / crashing when doing updates - A few of us in Wessex are experiencing issues with TIS when trying to do updates - it keeps crashing and is being really slow loading over the past couple of days.”

  • Oct 8, 2024 1300 Message acknowledged

  • Oct 8, 2024 1328 Last minute when multiple requests to TCS failed. Some errors occurred later, close to 3 pm & 5-6pm

  • Oct 8, 2024 1342 Errors spotted on esr-sentry channel so ESR services stopped to prevent potential errors

  • Oct 8, 2024 1353 Service restarted and reported to users that looks resolved

  • Oct 8, 2024 1410 Users confirmed things back to normal

  • Oct 11, 2024 16:00 Logs relating to ESR checked and ESR services restarted

  • Oct 12, 2024 11:00 Applicant records from 8th Oct sent out

  • Oct 12, 2024 12:00 Notification records from 8th Oct sent out excluding stale

  • Oct 30, 2024 Triggered a 502 error by overloading the tcs service on stage with a high volume of requests to increase CPU utilization

fnfb-502.PNG

5 Whys (or other analysis of Root Cause)

Q. Why was TIS responding slowly or crashing?
A. TCS endpoints were not returning data and timing out with proxy errors

Q. Why was TCS erroring and not returning data
A. TCS was unable to process the requests in a timely manner due to high CPU utilization and memory.

Q. Why were there not enough tasks to handle the increased load?
A. Auto-scaling was not enabled, so the system did not automatically scale up the number of tasks in response to the surge in traffic.

Q. Why were there errors on sentry-esr?
A. TCS was unresponsive


Action Items

Action Items

Owner

 

Action Items

Owner

 

Implement Auto-Scaling

& Enhance Monitoring and Alerts

@Tobi Olufuwa

https://hee-tis.atlassian.net/browse/TIS21-6690

 

 

 

 

 

 

See also:


Lessons Learned

  • Importance of Auto-Scaling: The absence of auto-scaling left the system vulnerable to unexpected traffic surges, highlighting the critical need for automated scaling mechanisms to handle dynamic workloads efficiently.

  • Proactive Resource Management: Relying solely on static resource allocation can lead to bottlenecks during periods of high demand. Regular reviews of scaling configurations and resource management policies are essential to ensure the infrastructure can adapt to changing load patterns.

  • Monitoring and Early Detection: Enhanced monitoring and alerting are crucial for identifying system strain early. By detecting potential issues before they affect service performance, proactive interventions can be made to avoid disruptions.