Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Resolved

Date

Authors

Andy Dingley

Status

Done

Summary

NIMDTA reference service could not start

Impact

TIS NIMDTA unavailable

Table of Contents

Non-technical Description

Trigger

  • Failure of the Sync service during the job.

Detection

...

Slack notification in #monitoring

...

User noticed through the Person Search page that job was being rerun

...

Resolution

  • Rerun of PersonPlacementEmployingBodyTrustJob, PersonPlacementTrainingBodyTrustJob and PersonElasticSearchSyncJob.

Timeline

  • : 01:09 BST - PersonPlacementEmployingBodyTrustJob starts on production server, but does not complete

  • : 07:38 BST - Notification that PersonPlacementEmployingBodyTrustJob failed

  • : 07:43 BST - PersonPlacementEmployingBodyTrustJob restarted

  • : 07:55 BST - PersonPlacementTrainingBodyTrustJob restarted

  • : 08:14 BST - PersonPlacementEmployingBodyTrustJob completed successfully

  • : 08:28 BST - PersonPlacementTrainingBodyTrustJob completed successfully

  • : 08:28 BST - PersonElasticSearchSyncJob restarted

  • : 08:40 BST - PersonElasticSearchSyncJob completed successfully

Root Cause(s)

...

The PersonPlacementEmployingBodyTrustJob started as scheduled, but failed to complete.

  • The job started as normal: 2021-05-19 00:09:00.008 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Sync [PersonPlacementEmployingBodyTrustJob] started

  • The last log entry for the job was recorded at 01:02:33 2021-05-19 01:02:33.517 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [263658] and lastEmployingBodyId: [1922]

  • Errors started appearing from 01:12:00

    • 2021-05-19 01:12:00.136 ERROR 1 --- [onPool-worker-3] u.n.tis.sync.service.DataRequestService : RESTEASY004655: Unable to invoke request

    • 2021-05-19 01:18:20.204 INFO 1 --- [onPool-worker-0] o.apache.http.impl.execchain.RetryExec : I/O exception (org.apache.http.NoHttpResponseException) caught when processing request to {}->http://tcs:8093: The target server failed to respond

    • Various errors indicating service failure / timeouts / out-of-memory errors continued until 04:44:19

...

CPU usage for the Sync EC2 instance rose abruptly to >50% at approx. 01:00 and to 100% for the period approx. 01:50 - 05:30 (though it should be noted that other containers are running on that instance in addition to Sync). This was abnormal:

...

Syslogs for the EC2 instance did not provide any specific diagnostic information for this period.

...

Unfortunately ancillary logs for the TCS service were not available, since the service had been redeployed (rebuilding the docker container) before these could be inspected.

...

The reference service for NIMDTA was stuck in a cycle attempting to start-up due to an issue with our database migration tool. An upgrade to the migration tool changed the behaviour for missing migration steps, causing a failed start-up instead of ignoring the missing steps as it previously did.

...

Trigger

  • Upgraded Flyway version deployed to production.

...

Detection

  • NIMDTA user reported having an issue loading TIS.

...

Resolution

  • Delete the schema history entries for the seed data loaded by the Consolidated DR ETL.

  • Allow the reference service to restart and re-run Flyway validation.

    Image Added

...

Timeline

  • : 12:39 BST - Upgraded Flyway deployed to production

  • : 12:42 BST - Notification in #monitoring-prod, thought to be due to an undeployed ops change.

  • : 12:45 BST - Ops change merged and applied.

  • : 14:22 BST - NIMDTA user reported being unable to use the application

  • : 15:02 BST - Fix deployed.

Root Cause(s)

  • Flyway migration validation failed after an upgraded version was deployed.

  • There were migrations applied that were missing from the reference service.

  • The Consolidated DR ETL used Flyway to load seed data, those scripts were not available to the reference service.

  • The combination of behaviour change in latest Flyway versions and the way the migrations were ran caused the validation failure. The validation behaviour is correct, but was not foreseen in this case.

...

Action Items

Action Items

Owner

Monitor sync service for similar failures in future. If there is a reoccurrence, further investigation will be warranted.

Consider improving resource monitoring (memory usage in particular).

Lessons Learned

...

Failure of sync jobs to complete is only reported some hours later.

...

Document how to investigate Flyway migration issues.

Andy Dingley

...

Lessons Learned