Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Reuben Roberts

Status

DocumentingDone

Summary

https://hee-tis.atlassian.net/browse/TIS21-1593

Impact

Users had an inaccurate list of People on Admins-UI

...

  • The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.

    • The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

    • The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]

    • Errors started appearing from 01:12:00

      • 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request

      • 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond

      • Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19

  • CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:

  • Image Modified

    Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.

  • Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.

  • Further assessment of the root cause is not possible at this time.

...