Date |
|
Authors | |
Status | ResolvedDone |
Summary | NIMDTA reference service could not start |
Impact | TIS NIMDTA unavailable |
Table of Contents |
---|
Non-technical Description
Trigger
Failure of the Sync service during the job.
Detection
...
Slack notification in #monitoring
...
User noticed through the Person Search page that job was being rerun
...
Resolution
Rerun of
PersonPlacementEmployingBodyTrustJob
,PersonPlacementTrainingBodyTrustJob
andPersonElasticSearchSyncJob
.
Timeline
: 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete
: 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed
: 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted
: 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted
: 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully
: 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully
: 08:28 BST - PersonElasticSearchSyncJob restarted
: 08:40 BST - PersonElasticSearchSyncJob completed successfully
Root Cause(s)
...
The PersonPlacementEmployingBodyTrustJob
started as scheduled, but failed to complete.
The job started as normal:
2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started
The last log entry for the job was recorded at 01:02:33
2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]
Errors started appearing from 01:12:00
2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request
2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond
Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19
...
CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:
...
Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.
...
Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.
...
The reference service for NIMDTA was stuck in a cycle attempting to start-up due to an issue with our database migration tool. An upgrade to the migration tool changed the behaviour for missing migration steps, causing a failed start-up instead of ignoring the missing steps as it previously did.
...
Trigger
Upgraded Flyway version deployed to production.
...
Detection
NIMDTA user reported having an issue loading TIS.
...
Resolution
Delete the schema history entries for the seed data loaded by the Consolidated DR ETL.
Allow the reference service to restart and re-run Flyway validation.
...
Timeline
: 12:39 BST - Upgraded Flyway deployed to production
: 12:42 BST - Notification in #monitoring-prod, thought to be due to an undeployed ops change.
: 12:45 BST - Ops change merged and applied.
: 14:22 BST - NIMDTA user reported being unable to use the application
: 15:02 BST - Fix deployed.
Root Cause(s)
Flyway migration validation failed after an upgraded version was deployed.
There were migrations applied that were missing from the reference service.
The Consolidated DR ETL used Flyway to load seed data, those scripts were not available to the reference service.
The combination of behaviour change in latest Flyway versions and the way the migrations were ran caused the validation failure. The validation behaviour is correct, but was not foreseen in this case.
...
Action Items
Action Items | Owner | Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted. | Consider improving resource monitoring (memory usage in particular). |
---|
Lessons Learned
...
Failure of sync jobs to complete is only reported some hours later.
...
Document how to investigate Flyway migration issues. | ||
...