Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Date

Authors

Andy Dingley

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-2349

Impact

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing several of our services to stop functioning.


Trigger

  • TIS blue server ran out of disk space.


Detection

  • Slack monitoring alert.


Resolution

  • Removed unused docker images to reduce disk usage.


Timeline

  • 00:12 UTC - Notification in #monitoring-prod that TCS and Reference services were down on the blue server

  • 08:34 UTC - Issue identified as low disk space

  • 08:47 UTC - Issue resolved by deleting old unused docker images

  • 10:11 UTC - Preventative action taken on green server to reduce similar disk usage

Root Cause(s)

  • Blue server ran out of disk space

  • Large number of outdated docker images

  • We have no process to clean old docker images

  • We have inadequate monitoring on disk/storage usage


Action Items

Action Items

Owner

Status

Add monitoring for disk/storage space

https://hee-tis.atlassian.net/browse/TIS21-1383


Lessons Learned

  • We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.