Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Node went down and recovered. A node went down a second time but the cluster did not recover.

  • 01:29 - Slack Alert in the #monitoring-prod channel. Flagged to look at on Monday AM.

  • 01:23 - Slack Alert in the #monitoring-prod channel.

  • OOH support raised with AWS.

  • 08:08 - We let NI users know of the problem and this was acknowledged at 08:23.

  • Created a replacement resource.

  • 08:58 - Let them know TIS was up and running.

    Image Added
  • 11:52 - AWS confirmed the domain/cluster was up and running again and available for us to switch back.

  • Alternative cluster configurations are being tested.

  • 17:00 - Scheduled downtime to revert to modified cluster (with a different configuration of nodes).

  • 17:28 - Notified via slack we were using the modified cluster.

...

  • Image Added

...

Root Cause(s)

  • The nightly sync job failed.

  • The cluster wasn’t available

  • The cluster became unhealthy and entered an unelectable state

  • A node failed and a new one wasn’t brought online cleanly

  • One of the nodes went down and caused the cluster to go read-only

  • There weren’t enough nodes to provide guarantees about data consistency

  • The default behaviour is to ensure data consistency over availability of write transactions

...