2017-07-27 Application server out of disk space

Date2017-07-27
AuthorsGraham O'Regan (Unlicensed)
StatusComplete
SummaryThe production application server ran out of disk space so the application could no longer serve content.
ImpactThe Revalidation service was unavailable nationally.

Root Cause

The Docker containers are stored on the root partition and the build up of old images filled the remaining space. We have a Jenkins job to remove old containers and images but the move from the lin → tis domain names meant we ended up with double the number of images sat in storage.

Trigger

The Docker images took up most of the disk space and the remaining space was exhausted by log files.

Resolution

We reran the cleanup job on Jenkins and deleted images that were downloaded from the lin docker registry. We also manually deleted images that we knew were no longer needed such as Piwik.

Detection

We received alerts in the monitoring channel as the web server couldn't serve content. Users in WTV also notified us on the #revalidation channel. 

Action Items

Action ItemTypeOwnerIssue
Move the Docker image storage to VM instance storage.preventFayaz Abdul (Unlicensed)
Prometheus has been extended to alert on all volumespreventFayaz Abdul (Unlicensed)

Timeline

13:27 Prometheus sent Slack notification to #monitoring

13:51 WTV notified us in #revalidation

13:56 WTV confirmed that the system was working correctly again.

Supporting Information

https://monitoring.tis.nhs.uk/grafana/dashboard/db/tis-services?panelId=1&fullscreen&edit&orgId=1&tab=metrics&from=1501158302612&to=1501160299004&var-service=revalidation-health