Date | |||||||||
Authors | |||||||||
Status | Documenting | ||||||||
Summary | The reference service became unstable on one of the production servers
| ||||||||
Impact | Users should generally have been unaffected since the reference service continued to run on the other production server. |
Non-technical Description
Trigger
Detection
.
Resolution
Timeline
Root Cause(s)
TIS running at reduced capacity |
Non-technical Description
TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.
The “blue” server ran out of disk space, causing the reference service to stop functioning.
...
Trigger
TIS blue server ran out of disk space.
Detection
Slack monitoring alert.
...
Resolution
Removed unneeded stack dump files from /var/log/apps
...
Timeline
17:34 - Reference service failure alert on Slack
18:34 - Reference service recovery alert on Slack
18:44 - Reference service failure alert on Slack
These failures continue periodically until 07:34.
...
Root Cause(s)
Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage
...
Action Items
Action Items | Owner | ||||||||
---|---|---|---|---|---|---|---|---|---|
Consider prioritising
| |||||||||
...
Lessons Learned
We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.