2017-09-14 Application server out of disk space
Date | 2017-09-14 |
Authors | Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | The production application server ran out of disk space so the application could no longer serve content. |
Impact | The Revalidation service was unavailable nationally. |
Root Cause
The Docker containers are stored on the root partition and the build up of old images filled the remaining space. We have a Jenkins job to remove old containers and images but the move from the lin → tis domain names meant we ended up with double the number of images sat in storage.
Trigger
The Docker images took up most of the disk space and the remaining space was exhausted by log files.
Resolution
We reran the cleanup job on Jenkins and deleted images that were downloaded from the lin docker registry. We also manually deleted images that we knew were no longer needed such as Piwik.
Detection
We received alerts in the monitoring channel as the web server couldn't serve content. Users in WTV also notified us on the #revalidation channel.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Move the Docker image storage to VM instance storage. | prevent | Fayaz Abdul (Unlicensed) | |
Prometheus has been extended to alert on all volumes | prevent | Fayaz Abdul (Unlicensed) |
Timeline
03:30 Jenkins jobs failed which notified Slack #monitoring
8:41 We reran the GMC sync ETL https://build.tis.nhs.uk/jenkins/job/gmc-sync-prod/54/console
8:59 We reran the intrepid-reval-etl-all Jenkins job https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/329/console
Supporting Information
srochani [8:17 AM]
@fayaz @graham gmc-sync-prod,
elasticsearch-snapshot-prod,intrepid-reval-etl-all-prod and service-registry job failed
because of no space issue
[8:17]
because of No space left on device\n”, “unreachable”: true}
to retry, use: --limit @/home/jenkins/data/devops/ansible/tasks/gmcsync.retry
[8:18]
Can you please have a look..
fayaz [8:19 AM]
On it
graham [8:40 AM]
morning, have you rerun them?
fayaz [8:40 AM]
@srochani - please rerun them
graham [8:41 AM]
k, let me kick off gmc first
srochani [8:41 AM]
ok
fayaz [8:41 AM]
adding the cleanup steps as a jenkins job to run nightly till we move to the new servers
graham [8:41 AM]
https://build.tis.nhs.uk/jenkins/job/gmc-sync-prod/54/console
[8:42]
they shoudl be cleaning up the docker containers and old images, do you know why we didn’t get disc pressure warnings?
fayaz [8:43 AM]
I remember to silent them some time back, as its way ahead of our usual limit 85%
[8:43]
today will reinstate as priority
[8:43]
should I keep it as 95% instead
[8:44]
we don’t have anyspace left, on the new servers I am overriding the default 30G disk to 60G
graham [8:44 AM]
yup, set it back to 90%, some of the containers log very heavily
fayaz [8:44 AM]
we keep the last 10 days logs even after backup
[8:44]
same with indices
graham [8:45 AM]
can that be reduced in filebeat?
fayaz [8:46 AM]
curator should handle it
graham [8:46 AM]
but the logs?
fayaz [8:46 AM]
logs we have a logrotate and backup thing, I can reconfigure it with 3 days instead of 10
graham [8:48 AM]
can u set to 5 days, that allows for easter and christmas hols
fayaz [8:48 AM]
ok,
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213