Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • The PersonPlacementEmployingBodyTrustJob started as scheduled (on Prod green), but failed to complete
    2021-09-08 00:09:00.009 INFO 1 --- [onPool-worker-1] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

  • The last successful chunks for this job were written at 02:40 BST and 02:41:33 BST. These took approx. 25s and 31s respectively.

  • SQS queueing errors began at 02:52 BST:
    2021-09-08 01:52:24.253 ERROR 1 --- [onPool-worker-2] uk.nhs.tis.sync.job.RecordResendingJob : Unable to execute HTTP request: Remote host terminated the handshake

    com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.services.sqs.AmazonSQSClient.doInvoke(AmazonSQSClient.java:1792) ~[aws-java-sdk-sqs-1.11.106.jar:na]
    ...

  • Further thread starvation messages might indicate issues with obtaining a JDBC database connection:

    Code Block
    2021-09-08 02:09:15.190  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m50s844ms18?s352ns).
    2021-09-08 02:10:20.030  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=4m44s49ms194?s903ns).

  • There was a large spike in Empty Receives on the tis-trainee-sync-queue-prod queue, which is populated from TIS Self-service and polled by tis-sync.

  • Simultaneously, MySQL connection issues were noted on tis-sync.
    TCS showed no logging activity during the period (22:16 the previous night to 06:07 UTC) and unfortunately the Reference service logs were rolled-over before they could be examined and had no content prior to 9 Sep.
    As there were also no abnormal errors logged by the MySQL database at that time, it could be that MySQL closed the connection to tis-sync due to a lack of activity, rather than an issue on the database server itself. This lack of activity could be due to polling the empty tis-trainee-sync-queue-prod queue, as indicated above, though the tis-trainee-sync logs showed nothing obviously abnormal around that time (https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/awslogs-prod-tis-trainee-sync/log-events/awslogs-tis-trainee-sync$252Ftis-trainee-sync$252F2bba9d4220b44f4d9c9ee641aa6bbcf6$3Fstart$3D1631164100000, https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/awslogs-prod-tis-trainee-sync/log-events/awslogs-tis-trainee-sync$252Ftis-trainee-sync$252F4353fa77de8044809708a72e3ebba5c4$3Fstart$3D1631077705000, https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/awslogs-prod-tis-trainee-sync/log-events/awslogs-tis-trainee-sync$252Ftis-trainee-sync$252F1b46a77475e84bc4b6f8de98068fe372$3Fstart$3D1631164105000).
    2021-09-08 05:08:27.549 WARN 1 --- [onPool-worker-0] com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@7a00d984 marked as broken because of SQLSTATE(08007), ErrorCode(0) java.sql.SQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:110) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:63) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:1848) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.zaxxer.hikari.pool.ProxyConnection.rollback(ProxyConnection.java:396) ~[HikariCP-3.4.5.jar:na] at com.zaxxer.hikari.pool.HikariProxyConnection.rollback(HikariProxyConnection.java) [HikariCP-3.4.5.jar:na] at org.hibernate.resource.jdbc.internal.AbstractLogicalConnectionImplementor.rollback(AbstractLogicalConnectionImplementor.java:121) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.rollback(JdbcResourceLocalTransactionCoordinatorImpl.java:304) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.engine.transaction.internal.TransactionImpl.rollback(TransactionImpl.java:142) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at uk.nhs.tis.sync.job.TrustAdminSyncJobTemplate.run(TrustAdminSyncJobTemplate.java:113) [classes/:1.13.0] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[na:1.8.0_292] at java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1632) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ~[na:1.8.0_292]

  • The restart of this job at 07:49 BST triggered further SQS errors of the type shown above, and finally failed due to an OOM error on Prod-green:
    2021-09-08 08:12:52.496 INFO 1 --- [onPool-worker-3] uk.nhs.tis.sync.job.RecordResendingJob : Reading [Record Resending job] started 2021-09-08 08:16:28.407 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=46s274ms874?s695ns). 2021-09-08 08:20:26.806 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=5m18s714ms721?s911ns). java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof ... Unable to create /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof: File exists Terminating due to java.lang.OutOfMemoryError: Java heap space Setting Active Processor Count to 4 Adding $JAVA_OPTS to $JAVA_TOOL_OPTIONS Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx2432201K -XX:MaxMetaspaceSize=201526K -XX:ReservedCodeCacheSize=240M -Xss1M (Total Memory: 3G, Thread Count: 250, Loaded Class Count: 33166, Headroom: 0%) Adding 129 container CA certificates to JVM truststore Spring Cloud Bindings Enabled

  • The abnormally high message queue load can be seen in TIS Self Service Records from 00:10 UTC to 16:30 UTC on the 9th Aug:

...