Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Authors

Andy Dingley

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-2349

Impact

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing several of our services to stop functioning.


Trigger

  • TIS blue server ran out of disk space.


Detection

  • Slack monitoring alert.


Resolution

  • Removed unused docker images to reduce disk usage.


Timeline

  • 00:12 UTC - Notification in #monitoring-prod that TCS and Reference services were down on the blue server

  • 08:34 UTC - Issue identified as low disk space

  • 08:47 UTC - Issue resolved by deleting old unused docker images

  • 10:11 UTC - Preventative action taken on green server to reduce similar disk usage

Root Cause(s)

  • Blue server ran out of disk space

  • Large number of outdated docker images

  • We have no process to clean old docker images

  • ???


Action Items

Action Items

Owner

Status


Lessons Learned

  • No labels