2024-01-XX - Intermittent TCS connection issues
Date | Jan 2025 |
Authors | @Cai Willis |
Status | Documenting |
Summary | Intermittent Connection Errors in TCS for single tasks |
Impact | Occasionally unable to load pages on Admins UI |
Non-technical Description
All of our services run 3 simultaneous copies of themselves. One of these copies of out “TIS-Core-Service” was experiencing intermittent connection issues to other services such as the database. This meant that occasionally users might be unable to load data if they were being sent to the failing “copy”.
Trigger
High CPU and Memory usage causing tasks to fail with connection issues to other services
Detection
Sentry alerts indicating connection issues
Users experiencing slow/no loading
5 Whys (or other analysis of Root Cause)
Why were pages failing to load intermittently on TIS admin? Because TCS had (a) failing task(s)
Why did TCS have failing tasks? Because it was experiencing high CPU and memory usage?
Why was TCS experiencing high CPU and memory usage? Tasks were failing during periods of high traffic, and its possible that recent changes to profile service have increased load
Why were tasks failing during periods of high traffic/Why weren’t provided resources sufficient? - Needs investigation (current suspect: profile api calls)
Resolution
Increased JVM memory allowance
Timeline
All times GMT unless otherwise indicated.
Jan 6, 2025 and earlier - intermittent connectivity issues
Jan 7, 2025 08:51 Service tasks replaced
Jan 7, 2025 10:19 User reports in MS Teams
Jan 7, 2025 10:31 Service tasks replaced again, after which, users reported normal use again
Jan 8, 2025 14:10 JVM fix deployed
Jan 9, 2025 15:16 Observed that task now becomes unhealthy and is replaced automatically when resources are exceeded
Jan 10, 2025 Normal service Resumed
Action Items
Action Items | Owner |
|
---|---|---|
|
| |
Investigate increased use of Profile Service and impact on TCS |
|
|
|
|
|
See also:
Lessons Learned
Related content
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213