Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Date2017-07-31
AuthorsGraham O'Regan (Unlicensed)
StatusComplete
SummaryAn ad-hoc system update restarted the Docker process on the production application server which meant that all services were unavailable until the restart completed.
ImpactRevalidation service wasn't available nationally for 2-3 minutes

Root Cause

Running an update to the monitoring configurations inadvertently targeted the application servers in the inventory and one of the tasks caused the Docker process, which manages the TIS applications on each virtual machine, to restart.

Trigger

Running a monitoring update with Ansible using the 'all' inventory.

Resolution

The process restarted and that restarted the application containers.

Detection

We received alerts in the #monitoring channel from Prometheus.

Action Items

Action ItemTypeOwnerIssue
Platform updates will only be run out of hours or on non-active nodes.mitigateFayaz Abdul (Unlicensed)

Timeline


Supporting Information

https://monitoring.tis.nhs.uk/grafana/dashboard/db/tis-services?panelId=1&fullscreen&edit&orgId=1&tab=metrics&from=1501157964618&to=1501160879804&var-service=revalidation-health

  • No labels