Date |
|
Authors | |
Status | Resolved |
Summary | |
Impact |
Non-technical Description
Trigger
Failure of the Sync service during the job.
Detection
Slack notification in #monitoring
User noticed through the Person Search page that job was being rerun
Resolution
Rerun of
PersonPlacementEmployingBodyTrustJob
,PersonPlacementTrainingBodyTrustJob
andPersonElasticSearchSyncJob
.
Timeline
: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted
: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted
: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully
: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully
: 08:28 BST - PersonElasticSearchSyncJob restarted
: 08:40 BST - PersonElasticSearchSyncJob completed successfully
Root Cause(s)
The
PersonPlacementEmployingBodyTrustJob
started as scheduled, but failed to complete.The job started as normal:
2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
The last log entry for the job was recorded at 01:02:33
2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]
Errors started appearing from 01:12:00
2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request
2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond
Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19
CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:
Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.
Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.
Further assessment of the root cause is not possible at this time.
Action Items
Action Items | Owner | |
---|---|---|
Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted. |
| |
Consider improving resource monitoring (memory usage in particular). |
| |
Lessons Learned
Failure of sync jobs to complete is only reported some hours later.
Better resource monitoring could assist with RCA in these circumstances.
Add Comment