2021-11-18 TIS blue server out of disk space
Date | Nov 18, 2021 |
Authors | @Andy Dingley |
Status | Done |
Summary | |
Impact | TIS running at reduced capacity |
Non-technical Description
TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.
The “blue” server ran out of disk space, causing several of our services to stop functioning.
Trigger
TIS blue server ran out of disk space.
Detection
Slack monitoring alert.
Resolution
Removed unused docker images to reduce disk usage.
Timeline
Nov 18, 2021 00:12 UTC - Notification in
#monitoring-prod
that TCS and Reference services were down on the blue serverNov 18, 2021 08:34 UTC - Issue identified as low disk space
Nov 18, 2021 08:47 UTC - Issue resolved by deleting old unused docker images
Nov 18, 2021 10:11 UTC - Preventative action taken on green server to reduce similar disk usage
Root Cause(s)
Blue server ran out of disk space
Large number of outdated docker images
We have no process to clean old docker images
We have inadequate monitoring on disk/storage usage
Action Items
Action Items | Owner | Status |
---|---|---|
Add monitoring for disk/storage space |
| |
Review old triggers and get them working again |
|
|
Lessons Learned
We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213