2017-07-27 Application server out of disk space
Date | 2017-07-27 |
Authors | Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | The production application server ran out of disk space so the application could no longer serve content. |
Impact | The Revalidation service was unavailable nationally. |
Root Cause
The Docker containers are stored on the root partition and the build up of old images filled the remaining space. We have a Jenkins job to remove old containers and images but the move from the lin → tis domain names meant we ended up with double the number of images sat in storage.
Trigger
The Docker images took up most of the disk space and the remaining space was exhausted by log files.
Resolution
We reran the cleanup job on Jenkins and deleted images that were downloaded from the lin docker registry. We also manually deleted images that we knew were no longer needed such as Piwik.
Detection
We received alerts in the monitoring channel as the web server couldn't serve content. Users in WTV also notified us on the #revalidation channel.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Move the Docker image storage to VM instance storage. | prevent | Fayaz Abdul (Unlicensed) | |
Prometheus has been extended to alert on all volumes | prevent | Fayaz Abdul (Unlicensed) |
Timeline
13:27 Prometheus sent Slack notification to #monitoring
13:51 WTV notified us in #revalidation
13:56 WTV confirmed that the system was working correctly again.
Supporting Information
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213