2021-02-08 Northern Ireland TIS Instance: ElasticSearch down
Date | Feb 8, 2021 |
Authors | @Joseph (Pepe) Kelly @Reuben Roberts |
Status | Resolved |
Summary | The NIMDTA Sync jobs failed to start |
Impact | NIMDTA Person Search page wasn’t showing correct data until approximately 08:55 |
Non-technical Description
The overnight sync procedure for NI TIS was unable to run.
The “quick access” copy of data did not sync properly meaning the Person search page was not operating properly.
The team created a secondary copy to use which has resolved the issue. A support ticket has been raised with the hosting company that has identified potential changes to reduce the likelihood that this will happen again.
Trigger
Detection
Detected when slack notifications in the #monitoring-prod channel failed to appear
Resolution
Created a new cluster to use based on the terraform description
The sync jobs run fine when manually triggered however this is a temporary solution
Timeline
Feb 6, 2021 Node went down and recovered. A node went down a second time but the cluster did not recover.
Feb 7, 2021 01:29 - Slack Alert in the #monitoring-prod channel. Flagged to look at on Monday AM.
Feb 8, 2021 01:23 - Slack Alert in the #monitoring-prod channel.
Feb 8, 2021 OOH support raised with AWS.
Feb 8, 2021 08:08 - We let NI users know of the problem and this was acknowledged at 08:23.
Feb 8, 2021 Created a replacement resource.
Feb 8, 2021 08:58 - Let them know TIS was up and running.
Feb 8, 2021 11:52 - AWS confirmed the domain/cluster was up and running again and available for us to switch back.
Feb 10, 2021 Alternative cluster configurations are being tested.
Feb 10, 2021 17:00 - Scheduled downtime to revert to modified cluster (with a different configuration of nodes).
Feb 10, 2021 17:28 - Notified via slack we were using the modified cluster.
Root Cause(s)
The nightly sync job failed.
The cluster wasn’t available
The cluster became unhealthy and entered an unelectable state
A node failed and a new one wasn’t brought online cleanly
One of the nodes went down and caused the cluster to go read-only
There weren’t enough nodes to provide guarantees about data consistency
The default behaviour is to ensure data consistency over availability of write transactions
Action Items
Action Items | Owner |
---|---|
Modify the cluster and (with advance notice via. slack) switch the services back to the modified cluster | @John Simmons (Deactivated) |
Lessons Learned
Terraform for the win! The template for the resource meant we were able to create a new cluster very quickly and restore service.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213