Date |
|
Authors | |
Status | Resolved |
Summary | Database ran out of space and resulted in system failure. |
Impact | Users were unable to log into TIS for approx 20 mins uptime robot reported 11mins |
https://hee-tis.atlassian.net/browse/TISNEW-51905417
...
Trigger
BAU? Not clear anything inparticular in particular caused a jump in usage?
Resolution
...
User reported at 12
Uptime robot, once we took key cloak down - 11:11:44 in Uptime Robot logs*
Timeline
12:00 Users reported being unable to access TIS
12:07 fire fire call started
12:10ish restarted keycloak
12:15ish Sachin spots SQL DB is full
12:20 ish stuff is removed from database and it starts working again
11:23:43 - uptime robot reports TIS back online*
Action Items
Action Items | Owner |
---|---|
Fix monitoring: Alertmanager should send to #monitoring-prod rather than #monitoring? Uptime robot didn’t report outage until keycloak was unavailable Error messages need to be clear | |
Look at disk management | |
Decide on bigger disk? |
Lessons Learned
If we want more sophisticated monitoring on services then we have to either see if the API for uptime robot will be able to support this or look at another product. API info here: https://uptimerobot.com/api/
* Not sure on timezone, presume UTC