Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Reuben Roberts

Status

DocumentingDone

Summary

https://hee-tis.atlassian.net/browse/TIS21-1593

Impact

Users had an inaccurate list of People on Admins-UI

...

The Person Placement Employing Body Trust Job failed to run successfully.

...

Trigger

  • Failure of the Sync service during the job.

...

Detection

  • Slack notification in #monitoring

  • User noticed through the Person Search page that job was being rerun


...

Resolution

  • Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.

...

Timeline

  • : 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete

  • : 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failsfailed

  • : 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted

  • : 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted

  • : 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully

  • : 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully

  • : 08:28 BST - PersonElasticSearchSyncJob restarted

  • : 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

...

  • The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.

    • The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

    • The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]

    • Errors started appearing from 01:12:00

      • 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request

      • 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond

      • Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19

  • CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:

  • Image Added

    Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.

  • Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.

  • Further assessment of the root cause is not possible at this time.

...

Action Items

Action Items

Owner

Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.

Consider improving resource monitoring (memory usage in particular).

...

Lessons Learned

  • Failure of sync jobs to complete is only reported some hours later.

  • Better resource monitoring could assist with RCA in these circumstances.