2021-05-19 Person Placement Employing Body Trust sync job failed affecting Person Search
Date | May 19, 2021 |
Authors | @Reuben Roberts |
Status | Done |
Summary | |
Impact | Users had an inaccurate list of People on Admins-UI |
Non-technical Description
The Person Placement Employing Body Trust Job failed to run successfully.
Trigger
Failure of the Sync service during the job.
Detection
Slack notification in #monitoring
User noticed through the Person Search page that job was being rerun
Resolution
Rerun of
PersonPlacementEmployingBodyTrustJob
,PersonPlacementTrainingBodyTrustJob
andPersonElasticSearchSyncJob
.
Timeline
May 19, 2021: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
May 19, 2021: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
May 19, 2021: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted
May 19, 2021: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted
May 19, 2021: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully
May 19, 2021: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully
May 19, 2021: 08:28 BST - PersonElasticSearchSyncJob restarted
May 19, 2021: 08:40 BST - PersonElasticSearchSyncJob completed successfully
Root Cause(s)
The
PersonPlacementEmployingBodyTrustJob
started as scheduled, but failed to complete.The job started as normal:
2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
The last log entry for the job was recorded at 01:02:33
2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]
Errors started appearing from 01:12:00
2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request
2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond
Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19
CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:
Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.
Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.
Further assessment of the root cause is not possible at this time.
Action Items
Action Items | Owner |
|
---|---|---|
Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted. |
|
|
Consider improving resource monitoring (memory usage in particular). |
|
|
|
|
|
|
|
|
Lessons Learned
Failure of sync jobs to complete is only reported some hours later.
Better resource monitoring could assist with RCA in these circumstances.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213