Date	2017-09-14
Authors	Graham O'Regan (Unlicensed)
Status	Complete
Summary	The production application server ran out of disk space so the application could no longer serve content.
Impact	The Revalidation service was unavailable nationally.

Root Cause

The Docker containers are stored on the root partition and the build up of old images filled the remaining space. We have a Jenkins job to remove old containers and images but the move from the lin → tis domain names meant we ended up with double the number of images sat in storage.

Trigger

The Docker images took up most of the disk space and the remaining space was exhausted by log files.

Resolution

We reran the cleanup job on Jenkins and deleted images that were downloaded from the lin docker registry. We also manually deleted images that we knew were no longer needed such as Piwik.

Detection

We received alerts in the monitoring channel as the web server couldn't serve content. Users in WTV also notified us on the #revalidation channel.

Action Items

Action Item	Type	Owner	Issue
Move the Docker image storage to VM instance storage.	prevent	Fayaz Abdul (Unlicensed)
Prometheus has been extended to alert on all volumes	prevent	Fayaz Abdul (Unlicensed)

Timeline

03:30 Jenkins jobs failed which notified Slack #monitoring

8:41 We reran the GMC sync ETL https://build.tis.nhs.uk/jenkins/job/gmc-sync-prod/54/console

8:59 We reran the intrepid-reval-etl-all Jenkins job https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/329/console

Supporting Information

srochani [8:17 AM]
@fayaz @graham gmc-sync-prod,
elasticsearch-snapshot-prod,intrepid-reval-etl-all-prod and service-registry job failed
because of no space issue

[8:17]
because of No space left on device\n”, “unreachable”: true}
to retry, use: --limit @/home/jenkins/data/devops/ansible/tasks/gmcsync.retry

[8:18]
Can you please have a look..

fayaz [8:19 AM]
On it

graham [8:40 AM]
morning, have you rerun them?

fayaz [8:40 AM]
@srochani - please rerun them

graham [8:41 AM]
k, let me kick off gmc first

srochani [8:41 AM]
ok

fayaz [8:41 AM]
adding the cleanup steps as a jenkins job to run nightly till we move to the new servers

graham [8:41 AM]
https://build.tis.nhs.uk/jenkins/job/gmc-sync-prod/54/console

[8:42]
they shoudl be cleaning up the docker containers and old images, do you know why we didn’t get disc pressure warnings?

fayaz [8:43 AM]
I remember to silent them some time back, as its way ahead of our usual limit 85%

[8:43]
today will reinstate as priority

[8:43]
should I keep it as 95% instead

[8:44]
we don’t have anyspace left, on the new servers I am overriding the default 30G disk to 60G

graham [8:44 AM]
yup, set it back to 90%, some of the containers log very heavily

fayaz [8:44 AM]
we keep the last 10 days logs even after backup

[8:44]
same with indices

graham [8:45 AM]
can that be reduced in filebeat?

fayaz [8:46 AM]
curator should handle it

graham [8:46 AM]
but the logs?

fayaz [8:46 AM]
logs we have a logrotate and backup thing, I can reconfigure it with 3 days instead of 10

graham [8:48 AM]
can u set to 5 days, that allows for easter and christmas hols

fayaz [8:48 AM]
ok,

TIS21 Confluence Space