2021-11-18 TIS blue server out of disk space

Date

Nov 18, 2021

Authors

@Andy Dingley

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-2349

Impact

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing several of our services to stop functioning.


Trigger

  • TIS blue server ran out of disk space.


Detection

  • Slack monitoring alert.


Resolution

  • Removed unused docker images to reduce disk usage.


Timeline

  • Nov 18, 2021 00:12 UTC - Notification in #monitoring-prod that TCS and Reference services were down on the blue server

  • Nov 18, 2021 08:34 UTC - Issue identified as low disk space

  • Nov 18, 2021 08:47 UTC - Issue resolved by deleting old unused docker images

  • Nov 18, 2021 10:11 UTC - Preventative action taken on green server to reduce similar disk usage

Root Cause(s)

  • Blue server ran out of disk space

  • Large number of outdated docker images

  • We have no process to clean old docker images

  • We have inadequate monitoring on disk/storage usage


Action Items

Action Items

Owner

Status

Action Items

Owner

Status

Add monitoring for disk/storage space

https://hee-tis.atlassian.net/browse/TIS21-1383

 

Review old triggers and get them working again

 

 


Lessons Learned

  • We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.