Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Rob Pink Steven Howard Joseph (Pepe) Kelly

Status

Summary

Users Some users experiencing issues with TIS - crashing , describing it as “crashing and slow loading over a couple of daysdays”.

Impact

Users unable to use TIS

Jira Legacy
serverSystem Jira
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-6610

...

Users on TIS reported that the application was very slow and occasionally crashing when attempting to retrieve or update data. This was the result of proxy errors being returned from the TCS enpoints. image-20241010-095351.png

...

Trigger

A spike in incoming requests caused the servers hosting key services to exceed their CPU capacity, leading to an inability to process these requests in a timely manner.

...

Detection

Notified by Users on Teams channel

...

Resolution

  • Replacing TCS tasks

...

Timeline

All times BST unless otherwise indicated.

  • 11:28 Numerous requests to TCS start responding with a “Bad Gateway” error. Logs show that all 3 tasks were receiving and processing some activity.

  • 1229 user reported problem on Teams “TIS being slow / crashing when doing updates - A few of us in Wessex are experiencing issues with TIS when trying to do updates - it keeps crashing and is being really slow loading over the past couple of days.”

  • 1300 Message acknowledged

  • 1328 Last minute when multiple requests to TCS failed. Some errors occurred later, close to 3 pm & 5-6pm

  • 1342 Errors spotted on esr-sentry channel so ESR services stopped to prevent potential errors

  • 1353 Service restarted and reported to users that looks resolved

  • 1410 Users confirmed things back to normal

  • 16:00 Logs relating to ESR checked and ESR services restarted

  • 11:00 Applicant records from 8th Oct sent out

  • 12:00 Notification records from 8th Oct sent out excluding stale

...

Q. Why was TCS erroring and not returning data
A. TCS was unable to process the requests in a timely manner due to high CPU utilization.

Q. ??Why were there not enough tasks to handle the increased load?
A. Auto-scaling was not enabled, so the system did not automatically scale up the number of tasks in response to the surge in traffic.

Q. Why were there errors on sentry-esr?
A. TCS was unresponsive

...