/
2021-11-18 TIS blue server out of disk space

2021-11-18 TIS blue server out of disk space

Date

Nov 18, 2021

Authors

@Andy Dingley

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-2349

Impact

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing several of our services to stop functioning.


Trigger

  • TIS blue server ran out of disk space.


Detection

  • Slack monitoring alert.


Resolution

  • Removed unused docker images to reduce disk usage.


Timeline

  • Nov 18, 2021 00:12 UTC - Notification in #monitoring-prod that TCS and Reference services were down on the blue server

  • Nov 18, 2021 08:34 UTC - Issue identified as low disk space

  • Nov 18, 2021 08:47 UTC - Issue resolved by deleting old unused docker images

  • Nov 18, 2021 10:11 UTC - Preventative action taken on green server to reduce similar disk usage

Root Cause(s)

  • Blue server ran out of disk space

  • Large number of outdated docker images

  • We have no process to clean old docker images

  • We have inadequate monitoring on disk/storage usage


Action Items

Action Items

Owner

Status

Action Items

Owner

Status

Add monitoring for disk/storage space

https://hee-tis.atlassian.net/browse/TIS21-1383

 

Review old triggers and get them working again

 

 


Lessons Learned

  • We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.

Related content

2022-02-04 Reference service unstable on one production server
2022-02-04 Reference service unstable on one production server
More like this
2017-10-02 Monitoring Failure caused oversight of application failure
2017-10-02 Monitoring Failure caused oversight of application failure
More like this
2022-04-22 ESR integration database went down
2022-04-22 ESR integration database went down
More like this
2023-06-13 Some deployments to ECS failing
2023-06-13 Some deployments to ECS failing
More like this
2021-08-26 TIS down
2021-08-26 TIS down
More like this
2022-07-29 Esr inbound data reader service error "no space left" on prod green environment
2022-07-29 Esr inbound data reader service error "no space left" on prod green environment
More like this