Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Authors

Andy Dingley

Status

Resolved

Summary

Impact

Non-technical Description


Trigger

  • Failure of the Sync service during the job.


Detection

  • Slack notification in #monitoring

  • User noticed through the Person Search page that job was being rerun



Resolution

  • Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.


Timeline

  • : 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete

  • : 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed

  • : 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted

  • : 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted

  • : 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully

  • : 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully

  • : 08:28 BST - PersonElasticSearchSyncJob restarted

  • : 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

  • The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.

    • The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

    • The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]

    • Errors started appearing from 01:12:00

      • 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request

      • 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond

      • Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19

  • CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:

  • Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.

  • Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.

  • Further assessment of the root cause is not possible at this time.


Action Items

Action Items

Owner

Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.

Consider improving resource monitoring (memory usage in particular).


Lessons Learned

  • Failure of sync jobs to complete is only reported some hours later.

  • Better resource monitoring could assist with RCA in these circumstances.

  • No labels