Date |
|
Authors | |
Status | Documenting |
Summary | |
Impact | Users had an inaccurate list of People on Admins-UI |
Non-technical Description
The Person Placement Employing Body Trust Job failed to run successfully.
Trigger
Failure of the Sync service during the job.
Detection
Slack notification in #monitoring
Resolution
Rerun of
PersonPlacementEmployingBodyTrustJob
,PersonPlacementTrainingBodyTrustJob
andPersonElasticSearchSyncJob
.
Timeline
: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
: 07:36 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
: 07:49 BST - PersonPlacementEmployingBodyTrustJob restarted, but fails silently
: 10:00 BST - Stand-up and post-stand-up discussion on way forward
: 11:28 BST - PersonPlacementEmployingBodyTrustJob restarted
: 11:35 BST - PersonPlacementTrainingBodyTrustJob restarted
: 12:04 BST - PersonPlacementEmployingBodyTrustJob completed successfully
: 12:13 BST - PersonPlacementTrainingBodyTrustJob completed successfully
: 12:21 BST - PersonSynJob started
: 12:33 BST - PersonSynJob completed successfully
Root Cause(s)
The
PersonPlacementEmployingBodyTrustJob
started as scheduled (on Prod green), but failed to complete2021-09-08 00:09:00.009 INFO 1 --- [onPool-worker-1] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
SQS queueing(?) errors began at 02:52 BST:
2021-09-08 01:52:24.253 ERROR 1 --- [onPool-worker-2] uk.nhs.tis.sync.job.RecordResendingJob : Unable to execute HTTP request: Remote host terminated the handshake
com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1207) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1153) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.1026.jar:na]
at com.amazonaws.services.sqs.AmazonSQSClient.doInvoke(AmazonSQSClient.java:1792) ~[aws-java-sdk-sqs-1.11.106.jar:na]
...
The restart of this job at 07:49 BST triggered further SQS errors of the type shown above, and finally failed due to an OOM error on Prod-green:
2021-09-08 08:12:52.496 INFO 1 --- [onPool-worker-3] uk.nhs.tis.sync.job.RecordResendingJob : Reading [Record Resending job] started 2021-09-08 08:16:28.407 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=46s274ms874?s695ns). 2021-09-08 08:20:26.806 WARN 1 --- [l-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : HikariPool-1 - Thread starvation or clock leap detected (housekeeper delta=5m18s714ms721?s911ns). java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof ... Unable to create /var/log/apps/hprof/sync-2021-07-07-12:38:37.hprof: File exists Terminating due to java.lang.OutOfMemoryError: Java heap space Setting Active Processor Count to 4 Adding $JAVA_OPTS to $JAVA_TOOL_OPTIONS Calculated JVM Memory Configuration: -XX:MaxDirectMemorySize=10M -Xmx2432201K -XX:MaxMetaspaceSize=201526K -XX:ReservedCodeCacheSize=240M -Xss1M (Total Memory: 3G, Thread Count: 250, Loaded Class Count: 33166, Headroom: 0%) Adding 129 container CA certificates to JVM truststore Spring Cloud Bindings Enabled
Action Items
Action Items | Owner | |
---|---|---|
Why so many placementSpecialties are not found in TISSS sync, triggering another call to TIS sync? Are these for deleted records, if they are in turn not found by TIS sync, or is there a bug / other issue here? | Done | |
Investigate adjusting the cron record-resending job for TISSS sync: either stop it running during the nightly sync job timeframe, run it every 10min instead of every 1min, or determine how to disable it while another job is running |
| |
Rerun the sync jobs one by one, confirm success at each step | Done | |
Write tickets to handle record deletions (currently marked with a //TODO) |
Lessons Learned
TODO
0 Comments