Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Yafang Deng Joseph (Pepe) Kelly Cai Willis

Status

Documentation

Summary

Users experiencing issues with TIS person - placement list, describing it as “Placement Grade/Site intermittent loading“

WILLIAMS, Gwilym (NHS ENGLAND - T1510): Placement Grade intermittent loading

posted in TIS Support Channel / General on 05 November 2024 10:21

Impact

Placement list intermittently not showing site/grade names.

Jira Legacy
serverSystem Jira
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-6685

...

Non-technical Description

Users sometime could not see grade name or site name on Person-Placement list.

...

Trigger

Task did not have memory available.

...

Detection

User alerted via Teams.

...

Resolution

  • Stop Stopped the tcs task 6585721d608b46f4bdc65c66b9bb27d2 and then aws restart restarted a new task automatically.

...

Timeline

All times BST unless otherwise indicated.

  • 15:39:51 Sentry errored: RESTEASY004655: Unable to invoke request: java.lang.IllegalStateException: Connection pool shut down .

  • 10:21 User reported the problem.

  • 11:10 We found only one tcs task erroring and stopped it, then a new task was up running.

  • 11:25 The user was informed of issue resolving.

  • 15:25 - 17:05 Error retrieving authorization information and OutOfMemoryError

5 Whys (or other analysis of Root Cause)

  1. Why placement list page didn’t show the grade or site name?
    TCS goes to Reference service to decorate sites/grades for placements, and it errored.

  2. Why did it happen intermittently?
    Because only one TCS task errored: Connection pool shut down.

...

  1. Why did we get a Connection pool shut down error?
    There’re 1.7K similar errors (Connection pool shut down) in TCS prod, and it started on Nov 4, 2024 3:39:51 PM UTC.
    When looking forward on the logs, on 2024-11-04 15:36:19.736, we got:

    Code Block
    2024-11-04 15:36:19.634  WARN 1 --- [ XNIO-2 task-16] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 0, SQLState: S1000
    2024-11-04 15:36:19.639 ERROR 1 --- [ XNIO-2 task-16] o.h.engine.jdbc.spi.SqlExceptionHelper   : 0
    2024-11-04 15:36:19.736 ERROR 1 --- [ XNIO-2 task-16] o.s.orm.jpa.EntityManagerFactoryUtils    : Failed to release JPA EntityManager
    org.hibernate.exception.GenericJDBCException: Unable to release JDBC Connection
    ...
    2024-11-04 15:36:19.749 ERROR 1 --- [ XNIO-2 task-16] o.s.t.i.TransactionInterceptor           : Application exception overridden by rollback exception
    java.lang.OutOfMemoryError: Java heap space
    2024-11-04 15:36:19.837 ERROR 1 --- [ XNIO-2 task-16] c.t.h.t.t.s.e.ExceptionTranslator        : Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection
    org.springframework.transaction.TransactionSystemException: Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection

    And then, we got a lot of com.netflix.hystrix.exception.HystrixRuntimeException: GET_USER_PROFILE timed-out and no fallback available., following the errors below:

    Code Block
    2024-11-04 15:39:51.760 ERROR 1 --- [strix-PROFILE-1] com.netflix.hystrix.AbstractCommand      : Unrecoverable Error for HystrixCommand so will throw HystrixRuntimeException and not apply fallback. 
    java.lang.Exception: Throwable caught while executing.
    Caused by: java.lang.OutOfMemoryError: Java heap space
    2024-11-04 15:39:51.810 ERROR 1 --- [ XNIO-2 task-11] c.t.h.t.t.s.e.ExceptionTranslator        : Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    2024-11-04 15:39:51.830  WARN 1 --- [  XNIO-2 task-3] com.zaxxer.hikari.pool.ProxyConnection   : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@31a4a7c1 marked as broken because of SQLSTATE(08S01), ErrorCode(0)
    com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
    Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
    Caused by: java.net.SocketException: Connection reset
  2. It’s possible this was casued by the changes we did on how TCS talks to Reference service on Sep 19.
    We searched for the previous logs, and noticed between Aug 01 to Sep 30, we had an OutOfMemory error on 2024-08-27T12:30:52.496Z. And then we searched for the logs between Feb and Jul, and noticed during this period, the first time we got OutOfMemory error was on 08/05/2024 14:40:40.070 UTC.

This live defect is very likely linked to the another one: 2024-10-08 Users experiencing slow loading and crashing . And we would like to monitor it and see if it still happens after the auto-scaling.

...

Action Items

Action Items

Owner

Implement Auto-Scaling

& Enhance Monitoring and Alerts

Tobi Olufuwa

Jira Legacy
serverSystem Jira
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-6690

...

Lessons Learned