Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Date

Authors

Reuben Roberts

Status

Documenting

Summary

The reference service became unstable on one of the production servers TIS21-2644 - Getting issue details... STATUS

Impact

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing the reference service to stop functioning.


Trigger

  • TIS blue server ran out of disk space.

Detection

  • Slack monitoring alert.


Resolution

  • Removed unneeded stack dump files from /var/log/apps


Timeline

  • 17:34 - Reference service failure alert on Slack

  • 18:34 - Reference service recovery alert on Slack

  • 18:44 - Reference service failure alert on Slack

  • These failures continue periodically until 07:34.


Root Cause(s)

  • Blue server ran out of disk space

  • We have no process to clean unneeded files (old logs, stack dumps, etc.)

  • We have inadequate monitoring on disk/storage usage


Action Items

Action Items

Owner

Consider prioritising TIS21-1383 - Getting issue details... STATUS


Lessons Learned

  •  We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.

  • No labels