2017-07-31 TIS services unavailable

Date2017-07-31
Authors
StatusComplete
SummaryAn ad-hoc system update restarted the Docker process on the production application server which meant that all services were unavailable until the restart completed.
ImpactRevalidation service wasn't available nationally for 2-3 minutes

Root Cause

Running an update to the monitoring configurations inadvertently targeted the application servers in the inventory and one of the tasks caused the Docker process, which manages the TIS applications on each virtual machine, to restart.

Trigger

Running a monitoring update with Ansible using the 'all' inventory.

Resolution

The process restarted and that restarted the application containers.

Detection

We received alerts in the #monitoring channel from Prometheus.

Action Items

Action ItemTypeOwnerIssue
Platform updates will only be run out of hours or on non-active nodes.mitigate

Timeline


Supporting Information

https://monitoring.tis.nhs.uk/grafana/dashboard/db/tis-services?panelId=1&fullscreen&edit&orgId=1&tab=metrics&from=1501496454274&to=1501497850844&var-service=revalidation-health