Resolved

Date	26 May 2021
Authors	Andy Dingley
Status	Done
Summary	NIMDTA reference service could not start
Impact	TIS NIMDTA unavailable

Table of Contents

Non-technical Description

Trigger

Failure of the Sync service during the job.

Detection

...

Slack notification in #monitoring

...

User noticed through the Person Search page that job was being rerun

...

Resolution

Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.

Timeline

19 May 2021: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
19 May 2021: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
19 May 2021: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted
19 May 2021: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted
19 May 2021: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully
19 May 2021: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully
19 May 2021: 08:28 BST - PersonElasticSearchSyncJob restarted
19 May 2021: 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

...

The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.

The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]
Errors started appearing from 01:12:00
- 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request
- 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond
- Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19

...

CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:

...

Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.

...

Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.

...

The reference service for NIMDTA was stuck in a cycle attempting to start-up due to an issue with our database migration tool. An upgrade to the migration tool changed the behaviour for missing migration steps, causing a failed start-up instead of ignoring the missing steps as it previously did.

...

Trigger

Upgraded Flyway version deployed to production.

...

Detection

NIMDTA user reported having an issue loading TIS.

...

Resolution

Delete the schema history entries for the seed data loaded by the Consolidated DR ETL.
Allow the reference service to restart and re-run Flyway validation.
Image Added

...

Timeline

26 May 2021: 12:39 BST - Upgraded Flyway deployed to production
26 May 2021: 12:42 BST - Notification in #monitoring-prod, thought to be due to an undeployed ops change.
26 May 2021: 12:45 BST - Ops change merged and applied.
26 May 2021: 14:22 BST - NIMDTA user reported being unable to use the application
26 May 2021: 15:02 BST - Fix deployed.

Root Cause(s)

Flyway migration validation failed after an upgraded version was deployed.
There were migrations applied that were missing from the reference service.
The Consolidated DR ETL used Flyway to load seed data, those scripts were not available to the reference service.
The combination of behaviour change in latest Flyway versions and the way the migrations were ran caused the validation failure. The validation behaviour is correct, but was not foreseen in this case.

...

Action Items

Action Items	Owner	Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.	Consider improving resource monitoring (memory usage in particular).

Lessons Learned

...

Failure of sync jobs to complete is only reported some hours later.

...

Document how to investigate Flyway migration issues.	Andy Dingley

...

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Lessons Learned