Date	26 May 2021
Authors	Andy Dingley
Status	Resolved
Summary
Impact

Non-technical Description

Trigger

Failure of the Sync service during the job.

Detection

Slack notification in #monitoring
User noticed through the Person Search page that job was being rerun

Resolution

Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.

Timeline

19 May 2021: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
19 May 2021: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
19 May 2021: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted
19 May 2021: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted
19 May 2021: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully
19 May 2021: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully
19 May 2021: 08:28 BST - PersonElasticSearchSyncJob restarted
19 May 2021: 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.
- The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
- The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]
- Errors started appearing from 01:12:00
  - 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request
  - 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond
  - Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19
CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:
Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.
Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.
Further assessment of the root cause is not possible at this time.

Action Items

Action Items	Owner
Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.
Consider improving resource monitoring (memory usage in particular).

Lessons Learned

Failure of sync jobs to complete is only reported some hours later.
Better resource monitoring could assist with RCA in these circumstances.

2021-05-26 NIMDTA TIS not loading fully

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned