...
The
PersonPlacementEmployingBodyTrustJob
started as scheduled (on Prod green), but failed to complete2021-09-08 00:09:00.009 INFO 1 --- [onPool-worker-1] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
Last The last successful chunks for this job were written at 02:40 BST and 02:41:33 BST. These took approx. 25s and 31s respectively.
SQS queueing errors began at 02:52 BST:
2021-09-08 01:52:24.253 ERROR 1 --- [onPool-worker-2] uk.nhs.tis.sync.job.RecordResendingJob : Unable to execute HTTP request: Remote host terminated the handshake
com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.services.sqs.AmazonSQSClient.doInvoke(AmazonSQSClient.java:1792) ~[aws-java-sdk-sqs-1.11.106.jar:na]
...
Further thread starvation messages might indicate issues with obtaining a JDBC database connection:
Code Block 2021-09-08 02:09:15.190 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m50s844ms18?s352ns). 2021-09-08 02:10:20.030 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=4m44s49ms194?s903ns).
MySQL issues were also noted; simultaneously there There was a large spike in Empty Receives on the TISSS
tis-trainee-sync-queue-prod
queue (TODO check TCS/Reference for similar), which is populated from TIS Self-service and polled by tis-sync.Simultaneously, MySQL connection issues were noted on tis-sync.
TCS showed no logging activity during the period (22:16 the previous night to 06:07 UTC) and unfortunately the Reference service logs were rolled-over before they could be examined and had no content prior to 9 Sep.
As there were also no abnormal errors logged by the MySQL database at that time, it could be that MySQL closed the connection due to a lack of activity, rather than an issue on the database server itself. This lack of activity could be due to polling the emptytis-trainee-sync-queue-prod
queue, as indicated above.2021-09-08 05:08:27.549 WARN 1 --- [onPool-worker-0] com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@7a00d984 marked as broken because of SQLSTATE(08007), ErrorCode(0) java.sql.SQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:110) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:63) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:1848) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.zaxxer.hikari.pool.ProxyConnection.rollback(ProxyConnection.java:396) ~[HikariCP-3.4.5.jar:na] at com.zaxxer.hikari.pool.HikariProxyConnection.rollback(HikariProxyConnection.java) [HikariCP-3.4.5.jar:na] at org.hibernate.resource.jdbc.internal.AbstractLogicalConnectionImplementor.rollback(AbstractLogicalConnectionImplementor.java:121) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.rollback(JdbcResourceLocalTransactionCoordinatorImpl.java:304) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.engine.transaction.internal.TransactionImpl.rollback(TransactionImpl.java:142) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at uk.nhs.tis.sync.job.TrustAdminSyncJobTemplate.run(TrustAdminSyncJobTemplate.java:113) [classes/:1.13.0] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[na:1.8.0_292] at java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1632) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ~[na:1.8.0_292]
The restart of this job at 07:49 BST triggered further SQS errors of the type shown above, and finally failed due to an OOM error on Prod-green:
2021-09-08 08:12:52.496 INFO 1 --- [onPool-worker-3] uk.nhs.tis.sync.job.RecordResendingJob : Reading [Record Resending job] started 2021-09-08 08:16:28.407 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=46s274ms874?s695ns). 2021-09-08 08:20:26.806 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=5m18s714ms721?s911ns). java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof ... Unable to create /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof: File exists Terminating due to java.lang.OutOfMemoryError: Java heap space Setting Active Processor Count to 4 Adding $JAVA_OPTS to $JAVA_TOOL_OPTIONS Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx2432201K -XX:MaxMetaspaceSize=201526K -XX:ReservedCodeCacheSize=240M -Xss1M (Total Memory: 3G, Thread Count: 250, Loaded Class Count: 33166, Headroom: 0%) Adding 129 container CA certificates to JVM truststore Spring Cloud Bindings Enabled
The abnormally high message queue load can be seen in TIS Self Service Records from 00:10 UTC to 16:30 UTC on the 9th Aug:
...