Date	09 Sep 2020
Authors	Phil James (Unlicensed)
Status	Resolved
Summary	Database ran out of space and resulted in system failure.
Impact	Users were unable to log into TIS for approx 20 mins

Root Cause(s)

Database ran out space
- Slow logs seemed to take a disproportionate amount of space

Trigger

Action Items

Owner

Fix monitoring:

Alertmanager should send to #monitoring-prod rather than #monitoring?

Uptime robot didn’t report outage until keycloak was unavailable

Error messages need to be clear

Look at disk management

Decide on bigger disk?

If we want more sophisticated monitoring on services then we have to either see if the API for uptime robot will be able to support this or look at another product. API info here: https://uptimerobot.com/api/