Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Date

Authors

Yafang Deng Joseph (Pepe) Kelly Cai Willis

Status

Documentation

Summary

Users experiencing issues with TIS person - placement list, describing it as “Placement Grade/Site intermittent loading“

WILLIAMS, Gwilym (NHS ENGLAND - T1510): Placement Grade intermittent loading

posted in TIS Support Channel / General on 05 November 2024 10:21

Impact

Placement list intermittently not showing site/grade names.

TIS21-6685 - Getting issue details... STATUS

Non-technical Description

Users sometime could not see grade name or site name on Person-Placement list.


Trigger

TCS to MySQL connection was broken.


Detection

User alerted via Teams.


Resolution

  • Stopped the tcs task 6585721d608b46f4bdc65c66b9bb27d2 and then aws restarted a new task automatically.


Timeline

All times BST unless otherwise indicated.

  • 15:39:51 Sentry errored: RESTEASY004655: Unable to invoke request: java.lang.IllegalStateException: Connection pool shut down .

  • 10:21 User reported the problem.

  • 11:10 We found only one tcs task erroring and stopped it, then a new task was up running.

  • 11:25 The user was informed of issue resolving.

  • 15:25 - 17:05 Error retrieving authorization information and OutOfMemoryError

5 Whys (or other analysis of Root Cause)

  1. Why placement list page didn’t show the grade or site name?
    TCS goes to Reference service to decorate sites/grades for placements, and it errored.

  2. Why did it happen intermittently?
    Because only one TCS task errored: Connection pool shut down.

image-20241107-111242.png
  1. Why did we get a Connection pool shut down error?
    From Sentry errrors: https://health-education-england-9v.sentry.io/issues/6042348085/events/?project=5752964&referrer=project-issue-stream&sort=timestamp, there’re 1.7K similar errors in TCS prod, and it started on Nov 4, 2024 3:39:51 PM UTC.
    When looking forward on the logs, on 2024-11-04 15:36:19.736, we got:

    2024-11-04 15:36:19.634  WARN 1 --- [ XNIO-2 task-16] o.h.engine.jdbc.spi.SqlExceptionHelper   : SQL Error: 0, SQLState: S1000
    2024-11-04 15:36:19.639 ERROR 1 --- [ XNIO-2 task-16] o.h.engine.jdbc.spi.SqlExceptionHelper   : 0
    2024-11-04 15:36:19.736 ERROR 1 --- [ XNIO-2 task-16] o.s.orm.jpa.EntityManagerFactoryUtils    : Failed to release JPA EntityManager
    org.hibernate.exception.GenericJDBCException: Unable to release JDBC Connection
    ...
    2024-11-04 15:36:19.749 ERROR 1 --- [ XNIO-2 task-16] o.s.t.i.TransactionInterceptor           : Application exception overridden by rollback exception
    java.lang.OutOfMemoryError: Java heap space
    2024-11-04 15:36:19.837 ERROR 1 --- [ XNIO-2 task-16] c.t.h.t.t.s.e.ExceptionTranslator        : Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection
    org.springframework.transaction.TransactionSystemException: Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection

    And then, we got a lot of com.netflix.hystrix.exception.HystrixRuntimeException: GET_USER_PROFILE timed-out and no fallback available., following the errors below:

    2024-11-04 15:39:51.760 ERROR 1 --- [strix-PROFILE-1] com.netflix.hystrix.AbstractCommand      : Unrecoverable Error for HystrixCommand so will throw HystrixRuntimeException and not apply fallback. 
    java.lang.Exception: Throwable caught while executing.
    Caused by: java.lang.OutOfMemoryError: Java heap space
    2024-11-04 15:39:51.810 ERROR 1 --- [ XNIO-2 task-11] c.t.h.t.t.s.e.ExceptionTranslator        : Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    2024-11-04 15:39:51.830  WARN 1 --- [  XNIO-2 task-3] com.zaxxer.hikari.pool.ProxyConnection   : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@31a4a7c1 marked as broken because of SQLSTATE(08S01), ErrorCode(0)
    com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
    Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure
    Caused by: java.net.SocketException: Connection reset
  2. It’s possible this was casued by the changes we did on how TCS talks to Reference service: https://github.com/Health-Education-England/TIS-TCS/pull/1022
    We searched for the previous logs, and noticed between Aug 01 to Sep 30, we had an OutOfMemory error on 2024-08-27T12:30:52.496Z.

This live defect is very likely linked to the another one: 2024-10-08 Users experiencing slow loading and crashing . And we would like to monitor it and see if it still happens after the auto-scaling.


Action Items

Action Items

Owner

Implement Auto-Scaling

& Enhance Monitoring and Alerts

Tobi Olufuwa

TIS21-6690 - Getting issue details... STATUS


Lessons Learned

  • No labels