Date

04 Feb 2022

Authors

Status

Documenting

Summary

The reference service became unstable on one of the production servers

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-2644

Impact

Users should generally have been unaffected since the reference service continued to run on the other production server.

Non-technical Description

Trigger

Detection

.

Resolution

Timeline

Root Cause(s)

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing the reference service to stop functioning.

...

Trigger

TIS blue server ran out of disk space.

Detection

Slack monitoring alert.

...

Resolution

Removed unneeded stack dump files from /var/log/apps

...

Timeline

04 Feb 2022 17:34 - Reference service failure alert on Slack
04 Feb 2022 18:34 - Reference service recovery alert on Slack
04 Feb 2022 18:44 - Reference service failure alert on Slack
These failures continue periodically until 07 Feb 2022 07:34.

...

Root Cause(s)

Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage

...

Action Items

Action Items

Owner

Consider prioritising

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-1383

...

Lessons Learned

We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.

Versions Compared

Old Version 2

New Version 3

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Non-technical Description

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned