/
2024-01-XX - Intermittent TCS connection issues

2024-01-XX - Intermittent TCS connection issues

Date

Jan 2025

Authors

@Cai Willis

Status

Documenting

Summary

Intermittent Connection Errors in TCS for single tasks

Impact

Occasionally unable to load pages on Admins UI

Non-technical Description

All of our services run 3 simultaneous copies of themselves. One of these copies of out “TIS-Core-Service” was experiencing intermittent connection issues to other services such as the database. This meant that occasionally users might be unable to load data if they were being sent to the failing “copy”.


Trigger

High CPU and Memory usage causing tasks to fail with connection issues to other services


Detection

Sentry alerts indicating connection issues

Users experiencing slow/no loading


5 Whys (or other analysis of Root Cause)

  1. Why were pages failing to load intermittently on TIS admin? Because TCS had (a) failing task(s)

  2. Why did TCS have failing tasks? Because it was experiencing high CPU and memory usage?

  3. Why was TCS experiencing high CPU and memory usage? Tasks were failing during periods of high traffic, and its possible that recent changes to profile service have increased load

  4. Why were tasks failing during periods of high traffic/Why weren’t provided resources sufficient? - Needs investigation (current suspect: profile api calls)


Resolution

  • Increased JVM memory allowance


Timeline

All times GMT unless otherwise indicated.

  • Jan 6, 2025 and earlier - intermittent connectivity issues

  • Jan 7, 2025 08:51 Service tasks replaced

  • Jan 7, 2025 10:19 User reports in MS Teams

  • Jan 7, 2025 10:31 Service tasks replaced again, after which, users reported normal use again

  • Jan 8, 2025 14:10 JVM fix deployed

  • Jan 9, 2025 15:16 Observed that task now becomes unhealthy and is replaced automatically when resources are exceeded

  • Jan 10, 2025 Normal service Resumed


Action Items

Action Items

Owner

 

Action Items

Owner

 

https://hee-tis.atlassian.net/browse/TIS21-6834

 

 

Investigate increased use of Profile Service and impact on TCS

 

 

 

 

 

See also:


Lessons Learned

  •  

Related content

2024-10-08 Users experiencing slow loading and crashing
2024-10-08 Users experiencing slow loading and crashing
More like this
2024-12-17 Search and other pages not working as expected
2024-12-17 Search and other pages not working as expected
More like this
2023-06-14 Person Search and other TIS functions not working
2023-06-14 Person Search and other TIS functions not working
More like this
2024-11-05 Placement Grade/Site intermittent loading
2024-11-05 Placement Grade/Site intermittent loading
More like this
2024-05-14 TIS services task deployment issue causing slowness and timeouts
2024-05-14 TIS services task deployment issue causing slowness and timeouts
More like this
2018-10-03/04 TCS D/B connectivity problems
2018-10-03/04 TCS D/B connectivity problems
More like this