2017-09-14 Application server out of disk space

Date2017-09-14
AuthorsGraham O'Regan (Unlicensed)
StatusComplete
SummaryThe production application server ran out of disk space so the application could no longer serve content.
ImpactThe Revalidation service was unavailable nationally.

Root Cause

The Docker containers are stored on the root partition and the build up of old images filled the remaining space. We have a Jenkins job to remove old containers and images but the move from the lin → tis domain names meant we ended up with double the number of images sat in storage.

Trigger

The Docker images took up most of the disk space and the remaining space was exhausted by log files.

Resolution

We reran the cleanup job on Jenkins and deleted images that were downloaded from the lin docker registry. We also manually deleted images that we knew were no longer needed such as Piwik.

Detection

We received alerts in the monitoring channel as the web server couldn't serve content. Users in WTV also notified us on the #revalidation channel. 

Action Items

Action ItemTypeOwnerIssue
Move the Docker image storage to VM instance storage.preventFayaz Abdul (Unlicensed)
Prometheus has been extended to alert on all volumespreventFayaz Abdul (Unlicensed)

Timeline

03:30 Jenkins jobs failed which notified Slack #monitoring

8:41 We reran the GMC sync ETL https://build.tis.nhs.uk/jenkins/job/gmc-sync-prod/54/console

8:59 We reran the intrepid-reval-etl-all Jenkins job https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/329/console

Supporting Information

srochani [8:17 AM]
@fayaz @graham gmc-sync-prod,
elasticsearch-snapshot-prod,intrepid-reval-etl-all-prod and service-registry job failed
because of no space issue


[8:17]
because of No space left on device\n”, “unreachable”: true}
to retry, use: --limit @/home/jenkins/data/devops/ansible/tasks/gmcsync.retry


[8:18]
Can you please have a look..


fayaz [8:19 AM]
On it


graham [8:40 AM]
morning, have you rerun them?


fayaz [8:40 AM]
@srochani - please rerun them


graham [8:41 AM]
k, let me kick off gmc first


srochani [8:41 AM]
ok


fayaz [8:41 AM]
adding the cleanup steps as a jenkins job to run nightly till we move to the new servers


graham [8:41 AM]
https://build.tis.nhs.uk/jenkins/job/gmc-sync-prod/54/console


[8:42]
they shoudl be cleaning up the docker containers and old images, do you know why we didn’t get disc pressure warnings?


fayaz [8:43 AM]
I remember to silent them some time back, as its way ahead of our usual limit 85%


[8:43]
today will reinstate as priority


[8:43]
should I keep it as 95% instead


[8:44]
we don’t have anyspace left, on the new servers I am overriding the default 30G disk to 60G


graham [8:44 AM]
yup, set it back to 90%, some of the containers log very heavily


fayaz [8:44 AM]
we keep the last 10 days logs even after backup


[8:44]
same with indices


graham [8:45 AM]
can that be reduced in filebeat?


fayaz [8:46 AM]
curator should handle it


graham [8:46 AM]
but the logs?


fayaz [8:46 AM]
logs we have a logrotate and backup thing, I can reconfigure it with 3 days instead of 10


graham [8:48 AM]
can u set to 5 days, that allows for easter and christmas hols


fayaz [8:48 AM]
ok,