2017-07-31 TIS services unavailable
Date | 2017-07-31 |
Authors | |
Status | Complete |
Summary | An ad-hoc system update restarted the Docker process on the production application server which meant that all services were unavailable until the restart completed. |
Impact | Revalidation service wasn't available nationally for 2-3 minutes |
Root Cause
Running an update to the monitoring configurations inadvertently targeted the application servers in the inventory and one of the tasks caused the Docker process, which manages the TIS applications on each virtual machine, to restart.
Trigger
Running a monitoring update with Ansible using the 'all' inventory.
Resolution
The process restarted and that restarted the application containers.
Detection
We received alerts in the #monitoring channel from Prometheus.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Platform updates will only be run out of hours or on non-active nodes. | mitigate |
Timeline
Supporting Information
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213