Date

08 Oct 2024

Authors

Rob Pink Steven Howard Joseph (Pepe) Kelly Tobi Olufuwa

Status

Done

Summary

Some users experiencing issues with TIS, describing it as “crashing and slow loading over a couple of days”.

Impact

Users unable to use TIS

Jira Legacy

server	System Jira
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-6610

...

Resolution

Replacing TCS tasks
Increased the TCS tasks memory from 2GB - 3GB

...

Timeline

All times BST unless otherwise indicated.

...

Q. Why was TCS erroring and not returning data
A. TCS was unable to process the requests in a timely manner due to high CPU utilization and memory.

Q. Why were there not enough tasks to handle the increased load?
A. Auto-scaling was not enabled, so the system did not automatically scale up the number of tasks in response to the surge in traffic.

...

Action Items

Owner

Implement Auto-Scaling

& Enhance Monitoring and Alerts

Tobi Olufuwa

Jira Legacy

server	System Jira
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-6690

Lessons Learned

Importance of Auto-Scaling: The absence of auto-scaling left the system vulnerable to unexpected traffic surges, highlighting the critical need for automated scaling mechanisms to handle dynamic workloads efficiently.
Proactive Resource Management: Relying solely on static resource allocation can lead to bottlenecks during periods of high demand. Regular reviews of scaling configurations and resource management policies are essential to ensure the infrastructure can adapt to changing load patterns.
Monitoring and Early Detection: Enhanced monitoring and alerting are crucial for identifying system strain early. By detecting potential issues before they affect service performance, proactive interventions can be made to avoid disruptions.

...

Versions Compared

Old Version 14

New Version Current

Key

Resolution

Timeline

Lessons Learned

Page Comparison

Versions Compared

Old Version 14

New Version Current

Key

Resolution

Timeline

Lessons Learned