Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Date

Authors

Reuben Roberts

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-2094

Impact

Users had an inaccurate list of People on Admins-UI

Non-technical Description

The Person Placement Employing Body Trust Job failed to run successfully.


Trigger

  • Failure of the Sync service during the job.


Detection

  • Slack notification in #monitoring


Resolution

  • Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.


Timeline

  • : 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete

  • : 02:41 BST - Last successful chunk written by the job (31 seconds to complete)

  • : 07:36 BST - Notification that PersonPlacementEmployingBodyTrustJob failed

  • : 07:49 BST - PersonPlacementEmployingBodyTrustJob restarted, but fails silently

  • : 10:00 BST - Stand-up and post-stand-up discussion on way forward

  • : 11:28 BST - PersonPlacementEmployingBodyTrustJob restarted

  • : 11:35 BST - PersonPlacementTrainingBodyTrustJob restarted

  • : 12:04 BST - PersonPlacementEmployingBodyTrustJob completed successfully

  • : 12:13 BST - PersonPlacementTrainingBodyTrustJob completed successfully

  • : 12:21 BST - PersonSynJob started

  • : 12:33 BST - PersonSynJob completed successfully

Root Cause(s)

  • The PersonPlacementEmployingBodyTrustJob started as scheduled (on Prod green), but failed to complete
    2021-09-08 00:09:00.009 INFO 1 --- [onPool-worker-1] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

  • Last successful chunks written at 02:40 BST and 02:41:33 BST. These took approx. 25s and 31s respectively.

  • SQS queueing errors began at 02:52 BST:
    2021-09-08 01:52:24.253 ERROR 1 --- [onPool-worker-2] uk.nhs.tis.sync.job.RecordResendingJob : Unable to execute HTTP request: Remote host terminated the handshake

    com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.1026.jar:na]
    at com.amazonaws.services.sqs.AmazonSQSClient.doInvoke(AmazonSQSClient.java:1792) ~[aws-java-sdk-sqs-1.11.106.jar:na]
    ...

  • Further thread starvation messages might indicate issues with obtaining a JDBC database connection:

    2021-09-08 02:09:15.190  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=1m50s844ms18?s352ns).
    2021-09-08 02:10:20.030  WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool        : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=4m44s49ms194?s903ns).

  • MySQL issues were also noted; simultaneously there was a large spike in Empty Receives on the TISSS tis-trainee-sync-queue-prod queue (TODO check TCS/Reference for similar)
    2021-09-08 05:08:27.549 WARN 1 --- [onPool-worker-0] com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection com.mysql.cj.jdbc.ConnectionImpl@7a00d984 marked as broken because of SQLSTATE(08007), ErrorCode(0) java.sql.SQLNonTransientConnectionException: Communications link failure during rollback(). Transaction resolution unknown. at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:110) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:89) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:63) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.mysql.cj.jdbc.ConnectionImpl.rollback(ConnectionImpl.java:1848) ~[mysql-connector-java-8.0.20.jar:8.0.20] at com.zaxxer.hikari.pool.ProxyConnection.rollback(ProxyConnection.java:396) ~[HikariCP-3.4.5.jar:na] at com.zaxxer.hikari.pool.HikariProxyConnection.rollback(HikariProxyConnection.java) [HikariCP-3.4.5.jar:na] at org.hibernate.resource.jdbc.internal.AbstractLogicalConnectionImplementor.rollback(AbstractLogicalConnectionImplementor.java:121) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.resource.transaction.backend.jdbc.internal.JdbcResourceLocalTransactionCoordinatorImpl$TransactionDriverControlImpl.rollback(JdbcResourceLocalTransactionCoordinatorImpl.java:304) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at org.hibernate.engine.transaction.internal.TransactionImpl.rollback(TransactionImpl.java:142) [hibernate-core-5.4.17.Final.jar:5.4.17.Final] at uk.nhs.tis.sync.job.TrustAdminSyncJobTemplate.run(TrustAdminSyncJobTemplate.java:113) [classes/:1.13.0] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[na:1.8.0_292] at java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1632) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) ~[na:1.8.0_292] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) ~[na:1.8.0_292]

  • The restart of this job at 07:49 BST triggered further SQS errors of the type shown above, and finally failed due to an OOM error on Prod-green:
    2021-09-08 08:12:52.496 INFO 1 --- [onPool-worker-3] uk.nhs.tis.sync.job.RecordResendingJob : Reading [Record Resending job] started 2021-09-08 08:16:28.407 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=46s274ms874?s695ns). 2021-09-08 08:20:26.806 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=5m18s714ms721?s911ns). java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof ... Unable to create /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof: File exists Terminating due to java.lang.OutOfMemoryError: Java heap space Setting Active Processor Count to 4 Adding $JAVA_OPTS to $JAVA_TOOL_OPTIONS Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx2432201K -XX:MaxMetaspaceSize=201526K -XX:ReservedCodeCacheSize=240M -Xss1M (Total Memory: 3G, Thread Count: 250, Loaded Class Count: 33166, Headroom: 0%) Adding 129 container CA certificates to JVM truststore Spring Cloud Bindings Enabled

  • The abnormally high message queue load can be seen in TIS Self Service Records from 00:10 UTC to 16:30 UTC on the 9th Aug:

Some technical information available @

https://hee-nhs-tis.slack.com/archives/GAGAPL1MZ/p1631533304005300?thread_ts=1631531639.004400&cid=GAGAPL1MZ

Action Items

Action Items

Owner

Why so many placementSpecialties are not found in TISSS sync, triggering another call to TIS sync? Are these for deleted records, if they are in turn not found by TIS sync, or is there a bug / other issue here?

[Found that: “spot checked some of the PlacementSpecialty messages in the DLQ, and found that the their placementIds are for Placements that have been recently deleted (found them in the ESR Audit queue tis.placement.deleted If a Placement gets deleted, as far as I know, the PlacementSpecialty gets deleted as well]

Marcello Fabbri (Unlicensed)

Done

Investigate adjusting the cron record-resending job for TISSS sync: either stop it running during the nightly sync job timeframe, run it every 10min instead of every 1min, or determine how to disable it while another job is running

TIS21-2107

Rerun the sync jobs one by one, confirm success at each step

Joseph (Pepe) Kelly

Done

Write tickets to handle record deletions (currently marked with a //TODO)


Lessons Learned

  • Group review is very useful!

  • No labels