06 Feb 2021 Node went down and recovered. A node went down a second time but the cluster did not recover.
07 Feb 2021 01:29 - Slack Alert in the #monitoring-prod channel. Flagged to look at on Monday AM.
08 Feb 2021 01:23 - Slack Alert in the #monitoring-prod channel.
08 Feb 2021 OOH support raised with AWS.
08 Feb 2021 08:08 - We let NI users know of the problem and this was acknowledged at 08:23.
08 Feb 2021 Created a replacement resource.
08 Feb 2021 08:58 - Let them know TIS was up and running.
Image Added
08 Feb 2021 11:52 - AWS confirmed the domain/cluster was up and running again and available for us to switch back.
10 Feb 2021 Alternative cluster configurations are being tested.
10 Feb 2021 17:00 - Scheduled downtime to revert to modified cluster (with a different configuration of nodes).
10 Feb 2021 17:28 - Notified via slack we were using the modified cluster.

...

Root Cause(s)

The nightly sync job failed.
The cluster wasn’t available
The cluster became unhealthy and entered an unelectable state
A node failed and a new one wasn’t brought online cleanly
One of the nodes went down and caused the cluster to go read-only
There weren’t enough nodes to provide guarantees about data consistency
The default behaviour is to ensure data consistency over availability of write transactions

...

Versions Compared