2021-05-19 Person Placement Employing Body Trust sync job failed affecting Person Search

Date

May 19, 2021

Authors

@Reuben Roberts

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-1593

Impact

Users had an inaccurate list of People on Admins-UI

Non-technical Description

The Person Placement Employing Body Trust Job failed to run successfully.


Trigger

  • Failure of the Sync service during the job.


Detection

  • Slack notification in #monitoring

     

  • User noticed through the Person Search page that job was being rerun




Resolution

  • Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.


Timeline

  • May 19, 2021: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete

  • May 19, 2021: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed

  • May 19, 2021: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted

  • May 19, 2021: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted

  • May 19, 2021: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully

  • May 19, 2021: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully

  • May 19, 2021: 08:28 BST - PersonElasticSearchSyncJob restarted

  • May 19, 2021: 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

  • The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.

    • The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

    • The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]

    • Errors started appearing from 01:12:00

      • 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request

      • 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond

      • Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19

  • CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:

  • Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.

  • Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.

  • Further assessment of the root cause is not possible at this time.


Action Items

Action Items

Owner

 

Action Items

Owner

 

Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.

 

Consider improving resource monitoring (memory usage in particular).

 

 

 

 

 

 

 


Lessons Learned

  • Failure of sync jobs to complete is only reported some hours later.

  • Better resource monitoring could assist with RCA in these circumstances.