Date	May 19, 2021
Authors	@Reuben Roberts
Status	Done
Summary	https://hee-tis.atlassian.net/browse/TIS21-1593
Impact	Users had an inaccurate list of People on Admins-UI

Non-technical Description

The Person Placement Employing Body Trust Job failed to run successfully.

Trigger

Failure of the Sync service during the job.

Detection

Slack notification in #monitoring
User noticed through the Person Search page that job was being rerun

Resolution

Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.

Timeline

May 19, 2021: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
May 19, 2021: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
May 19, 2021: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted
May 19, 2021: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted
May 19, 2021: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully
May 19, 2021: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully
May 19, 2021: 08:28 BST - PersonElasticSearchSyncJob restarted
May 19, 2021: 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.
- The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
- The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]
- Errors started appearing from 01:12:00
  - 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request
  - 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond
  - Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19
CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:
Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.
Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.
Further assessment of the root cause is not possible at this time.

Action Items

Action Items	Owner

Action Items	Owner
Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.
Consider improving resource monitoring (memory usage in particular).

Lessons Learned

Failure of sync jobs to complete is only reported some hours later.
Better resource monitoring could assist with RCA in these circumstances.

TIS21 Confluence Space

2021-05-19 Person Placement Employing Body Trust sync job failed affecting Person Search

Analytics

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Related content