Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly Reuben Roberts

Status

Resolved

Summary

The NIMDTA Sync jobs failed to start

Impact 

NIMDTA Person Search page wasn’t showing correct data until approximately 08:55

Table of Contents

Non-technical Description

The overnight sync procedure for NI TIS was interrupted and failed unable to completerun.

The “quick access” copy of data did not sync properly meaning the Person search page was not operating properly.

The team created a secondary copy to use which has resolved the issue. And a A support ticket has been raised with the hosting company that will help the team decide whether to fix and switch back to the old service, or remain with the new onehas identified potential changes to reduce the likelihood that this will happen again.

...

Trigger

...

...

Detection

  • Detected when slack notifications in the #monitoring-prod channel failed to appear

...

Resolution

  • Created a new cluster to use based on the terraform description

  • The sync jobs run fine when manually triggered however this is a temporary solution

...

Timeline

  • Node went down and recovered. A node went down a second time but the cluster did not recover.

  • 01:29 - Slack Alert in the #monitoring-prod channel. Flagged to look at on Monday AM.

  • 01:23 - Slack Alert in the #monitoring-prod channel.

  • OOH support raised with AWS.

  • 08:08 - We let NI users know of the problem and this was acknowledged at 08:23.

  • Created a replacement resource.

  • 08:58 - Let them know all backup and running.TIS was up and running.

  • 11:52 - AWS confirmed the domain/cluster was up and running again and available for us to switch back.

  • Alternative cluster configurations are being tested.

  • 17:00 - Scheduled downtime to revert to modified cluster (with a different configuration of nodes).

...

...

Root Cause(s)

...

  • The nightly sync job failed.

  • The cluster wasn’t available

  • The cluster became unhealthy and entered an unelectable state

  • A node failed and a new one wasn’t brought online cleanly

  • One of the nodes went down and caused the cluster to go read-only

  • There weren’t enough nodes to provide guarantees about data consistency

  • The default behaviour is to ensure data consistency over availability of write transactions

...

Action Items

Action Items

Owner

Modify the cluster and (with advance notice via. slack) switch the services back to the modified cluster

John Simmons (Deactivated)

...

Lessons Learned

  • Terraform for the win! The template for the resource meant we were able to create a new cluster very quickly and restore service.