Date |
|
Authors | |
Status | Done |
Summary | |
Impact | TIS running at reduced capacity |
Non-technical Description
TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.
The “blue” server ran out of disk space, causing several of our services to stop functioning.
Trigger
TIS blue server ran out of disk space.
Detection
Slack monitoring alert.
Resolution
Removed unused docker images to reduce disk usage.
Timeline
00:12 UTC - Notification in
#monitoring-prod
that TCS and Reference services were down on the blue server08:34 UTC - Issue identified as low disk space
08:47 UTC - Issue resolved by deleting old unused docker images
10:11 UTC - Preventative action taken on green server to reduce similar disk usage
Root Cause(s)
Blue server ran out of disk space
Large number of outdated docker images
We have no process to clean old docker images
We have inadequate monitoring on disk/storage usage
Action Items
Action Items | Owner | Status |
---|---|---|
Add monitoring for disk/storage space |
Lessons Learned
We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.
Add Comment