...
Users sometime could not see grade name or site name on Person-Placement list.
...
Trigger
TCS to MySQL connection was brokenTask did not have memory available.
...
Detection
User alerted via Teams.
...
Why did we get a Connection pool shut down error?
From Sentry errrors: https://health-education-england-9v.sentry.io/issues/6042348085/events/?project=5752964&referrer=project-issue-stream&sort=timestamp, there’re There’re 1.7K similar errors (Connection pool shut down
) in TCS prod, and it started on Nov 4, 2024 3:39:51 PM UTC.
When looking forward on the logs, on2024-11-04 15:36:19.736
, we got:Code Block 2024-11-04 15:36:19.634 WARN 1 --- [ XNIO-2 task-16] o.h.engine.jdbc.spi.SqlExceptionHelper : SQL Error: 0, SQLState: S1000 2024-11-04 15:36:19.639 ERROR 1 --- [ XNIO-2 task-16] o.h.engine.jdbc.spi.SqlExceptionHelper : 0 2024-11-04 15:36:19.736 ERROR 1 --- [ XNIO-2 task-16] o.s.orm.jpa.EntityManagerFactoryUtils : Failed to release JPA EntityManager org.hibernate.exception.GenericJDBCException: Unable to release JDBC Connection ... 2024-11-04 15:36:19.749 ERROR 1 --- [ XNIO-2 task-16] o.s.t.i.TransactionInterceptor : Application exception overridden by rollback exception java.lang.OutOfMemoryError: Java heap space 2024-11-04 15:36:19.837 ERROR 1 --- [ XNIO-2 task-16] c.t.h.t.t.s.e.ExceptionTranslator : Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection org.springframework.transaction.TransactionSystemException: Could not roll back JPA transaction; nested exception is org.hibernate.TransactionException: Unable to rollback against JDBC Connection
And then, we got a lot of
com.netflix.hystrix.exception.HystrixRuntimeException: GET_USER_PROFILE timed-out and no fallback available.
, following the errors below:Code Block 2024-11-04 15:39:51.760 ERROR 1 --- [strix-PROFILE-1] com.netflix.hystrix.AbstractCommand : Unrecoverable Error for HystrixCommand so will throw HystrixRuntimeException and not apply fallback. java.lang.Exception: Throwable caught while executing. Caused by: java.lang.OutOfMemoryError: Java heap space 2024-11-04 15:39:51.810 ERROR 1 --- [ XNIO-2 task-11] c.t.h.t.t.s.e.ExceptionTranslator : Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space 2024-11-04 15:39:51.830 WARN 1 --- [ XNIO-2 task-3] com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@31a4a7c1 marked as broken because of SQLSTATE(08S01), ErrorCode(0) com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure Caused by: java.net.SocketException: Connection reset
It’s possible this was casued by the changes we did on how TCS talks to Reference service :https://github.com/Health-Education-England/TIS-TCS/pull/1022on Sep 19.
We searched for the previous logs, and noticed between Aug 01 to Sep 30, we had an OutOfMemory error on2024-08-27T12:30:52.496Z
. And then we searched for the logs between Feb and Jul, and noticed during this period, the first time we got OutOfMemory error was on 08/05/2024 14:40:40.070 UTC.
This live defect is very likely linked to the another one: 2024-10-08 Users experiencing slow loading and crashing . And we would like to monitor it and see if it still happens after the auto-scaling.
...