Date	08 Sep 2021
Authors	Reuben Roberts Joseph (Pepe) Kelly Marcello Fabbri (Unlicensed)
Status	Documenting
Summary	https://hee-tis.atlassian.net/browse/TIS21-2094
Impact	Users had an inaccurate list of People on Admins-UI

Non-technical Description

The Person Placement Employing Body Trust Job failed to run successfully.

Trigger

Failure of the Sync service during the job.

Detection

Slack notification in #monitoring

Resolution

Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.

Timeline

08 Sep 2021: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
08 Sep 2021: 02:41 BST - Last successful chunk written by the job (31 seconds to complete)
08 Sep 2021: 07:36 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
08 Sep 2021: 07:49 BST - PersonPlacementEmployingBodyTrustJob restarted, but fails silently
08 Sep 2021: 10:00 BST - Stand-up and post-stand-up discussion on way forward
08 Sep 2021: 11:28 BST - PersonPlacementEmployingBodyTrustJob restarted
08 Sep 2021: 11:35 BST - PersonPlacementTrainingBodyTrustJob restarted
08 Sep 2021: 12:04 BST - PersonPlacementEmployingBodyTrustJob completed successfully
08 Sep 2021: 12:13 BST - PersonPlacementTrainingBodyTrustJob completed successfully
08 Sep 2021: 12:21 BST - PersonSynJob started
08 Sep 2021: 12:33 BST - PersonSynJob completed successfully

Root Cause(s)

The PersonPlacementEmployingBodyTrustJob started as scheduled (on Prod green), but failed to complete
2021-09-08 00:09:00.009 INFO 1 --- [onPool-worker-1] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
The last successful chunks for this job were written at 02:40 BST and 02:41:33 BST. These took approx. 25s and 31s respectively.
SQS queueing errors began at 02:52 BST:
2021-09-08 01:52:24.253 ERROR 1 --- [onPool-worker-2] uk.nhs.tis.sync.job.RecordResendingJob : Unable to execute HTTP request: Remote host terminated the handshake
com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.services.sqs.AmazonSQSClient.doInvoke(AmazonSQSClient.java:1792) ~[aws-java-sdk-sqs-1.11.106.jar:na]
...
These (or similar ‘com.amazonaws.SdkClientException: Unable to execute HTTP request: Couldn't kickstart handshaking') errors recurred at:
03:34 UTC
03:45 UTC
03:58 UTC
04:14 UTC
04:23 UTC
04:37 UTC
04:50 UTC
These may reflect throttling or connection limiting from AWS for the SQS service, or a more general networking issue as database connectivity also seemed poor, as per below:

Further thread starvation messages might indicate issues with obtaining a JDBC database connection:

2021-09-08 02:09:15.190  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m50s844ms18?s352ns).
2021-09-08 02:10:20.030  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=4m44s49ms194?s903ns).

There was a large spike in Empty Receives on the tis-trainee-sync-queue-prod queue, which is populated from TIS Self-service and polled by tis-sync.
Simultaneously, MySQL connection issues were noted on tis-sync.
TCS showed no logging activity during the period (22:16 the previous night to 06:07 UTC) and unfortunately the Reference service logs were rolled-over before they could be examined and had no content prior to 9 Sep.
As there were also no abnormal errors logged by the MySQL database at that time, it could be that MySQL closed the connection to tis-sync due to a lack of activity, rather than an issue on the database server itself. This lack of activity could be due to polling the empty tis-trainee-sync-queue-prod queue, as indicated above. However, tis-trainee-sync logs showed nothing obviously abnormal around that time that would suggest why no messages might be published to that queue (https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/awslogs-prod-tis-trainee-sync/log-events/awslogs-tis-trainee-sync$252Ftis-trainee-sync$252F2bba9d4220b44f4d9c9ee641aa6bbcf6$3Fstart$3D1631164100000, https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/awslogs-prod-tis-trainee-sync/log-events/awslogs-tis-trainee-sync$252Ftis-trainee-sync$252F4353fa77de8044809708a72e3ebba5c4$3Fstart$3D1631077705000, https://eu-west-2.console.aws.amazon.com/cloudwatch/home?region=eu-west-2#logsV2:log-groups/log-group/awslogs-prod-tis-trainee-sync/log-events/awslogs-tis-trainee-sync$252Ftis-trainee-sync$252F1b46a77475e84bc4b6f8de98068fe372$3Fstart$3D1631164105000).
2021-09-08 05:08:27.549 WARN 1 --- [onPool-worker-0] com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@7a00d984 marked as broken because of SQLSTATE(08007), ErrorCode(0) java.sql.SQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:110) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:63) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:1848) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.zaxxer.hikari.pool.ProxyConnection.rollback(ProxyConnection.java:396) ~[HikariCP-3.4.5.jar:na] at com.zaxxer.hikari.pool.HikariProxyConnection.rollback(HikariProxyConnection.java) [HikariCP-3.4.5.jar:na] at org.hibernate.resource.jdbc.internal.AbstractLogicalConnectionImplementor.rollback(AbstractLogicalConnectionImplementor.java:121) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.rollback(JdbcResourceLocalTransactionCoordinatorImpl.java:304) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.engine.transaction.internal.TransactionImpl.rollback(TransactionImpl.java:142) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at uk.nhs.tis.sync.job.TrustAdminSyncJobTemplate.run(TrustAdminSyncJobTemplate.java:113) [classes/:1.13.0] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[na:1.8.0_292] at java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1632) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ~[na:1.8.0_292]
The restart of this job at 07:49 BST triggered further SQS errors of the type shown above, and finally failed due to an OOM error on Prod-green:
2021-09-08 08:12:52.496 INFO 1 --- [onPool-worker-3] uk.nhs.tis.sync.job.RecordResendingJob : Reading [Record Resending job] started 2021-09-08 08:16:28.407 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=46s274ms874?s695ns). 2021-09-08 08:20:26.806 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=5m18s714ms721?s911ns). java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof ... Unable to create /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof: File exists Terminating due to java.lang.OutOfMemoryError: Java heap space Setting Active Processor Count to 4 Adding $JAVA_OPTS to $JAVA_TOOL_OPTIONS Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx2432201K -XX:MaxMetaspaceSize=201526K -XX:ReservedCodeCacheSize=240M -Xss1M (Total Memory: 3G, Thread Count: 250, Loaded Class Count: 33166, Headroom: 0%) Adding 129 container CA certificates to JVM truststore Spring Cloud Bindings Enabled
The abnormally high message queue load can be seen in TIS Self Service Records from 00:10 UTC to 16:30 UTC on the 9th Aug:

Some technical information available @

https://hee-nhs-tis.slack.com/archives/GAGAPL1MZ/p1631533304005300?thread_ts=1631531639.004400&cid=GAGAPL1MZ

Action Items

Action Items	Owner	Status
Why so many placementSpecialties are not found in TISSS sync, triggering another call to TIS sync? Are these for deleted records, if they are in turn not found by TIS sync, or is there a bug / other issue here? [Found that: “spot checked some of the PlacementSpecialty messages in the DLQ, and found that the their `placementIds` are for Placements that have been recently deleted (found them in the ESR Audit queue `tis.placement.deleted` If a Placement gets deleted, as far as I know, the PlacementSpecialty gets deleted as well]	Marcello Fabbri (Unlicensed)	Done
Investigate adjusting the cron record-resending job for TISSS sync: either stop it running during the nightly sync job timeframe, run it every 10min instead of every 1min, or determine how to disable it while another job is running		TIS21-2107
Rerun the sync jobs one by one, confirm success at each step	Joseph (Pepe) Kelly	Done
Write tickets to handle record deletions (currently marked with a //TODO)
Contact AWS to determine if there was an issue with SQS Feedback was that there were no particular issues with SQS around the time of the incident, but an open issue on the aws-java-sdk was referenced (https://github.com/aws/aws-sdk-java/issues/2269 ).	Reuben Roberts	Done (Case ID 8876575931)
Add ticket to investigate better error handling (e.g. backoff and retry) on SQS receiveMessage as per Github issue referenced above.	Reuben Roberts	Done (https://hee-tis.atlassian.net/browse/TIS21-2110 )

Lessons Learned

Group review for RCA and identifying alternative scenarios is very useful!

2021-09-08 Person Placement Employing Body Trust sync job failed affecting Person Search