...
Node went down and recovered. A node went down a second time but the cluster did not recover.
01:29 - Slack Alert in the #monitoring-prod channel. Flagged to look at on Monday AM.
01:23 - Slack Alert in the #monitoring-prod channel.
OOH support raised with AWS.
08:08 - We let NI users know of the problem and this was acknowledged at 08:23.
Created a replacement resource.
08:58 - Let them know TIS was up and running.
11:52 - AWS confirmed the domain/cluster was up and running again and available for us to switch back.
Alternative cluster configurations are being tested.
17:00 - Scheduled downtime to revert to modified cluster (with a different configuration of nodes).
17:28 - Notified via slack we were using the modified cluster.
...
...
Root Cause(s)
The nightly sync job failed.
The cluster wasn’t available
The cluster became unhealthy and entered an unelectable state
A node failed and a new one wasn’t brought online cleanly
One of the nodes went down and caused the cluster to go read-only
There weren’t enough nodes to provide guarantees about data consistency
The default behaviour is to ensure data consistency over availability of write transactions
...