2020-09-09 Users unable to log in to TIS

Date

Sep 9, 2020

Authors

@Phil James (Unlicensed)

Status

Resolved

Summary

Database ran out of space and resulted in system failure.

Impact

Users were unable to log into TIS for approx 20 mins

uptime robot reported 11mins

https://hee-tis.atlassian.net/browse/TISNEW-5417

Root Cause(s)

  • Database ran out space

    • Slow logs seemed to take a disproportionate amount of space

Trigger

  • BAU? Not clear anything in particular caused a jump in usage?

Resolution

  • Deleted some log files to clear space

Detection

  • User reported at 12

  • Uptime robot, once we took key cloak down - 11:11:44 in Uptime Robot logs*

Timeline

  • 12:00 Users reported being unable to access TIS

  • 12:07 fire fire call started

  • 12:10ish restarted keycloak

  • 12:15ish Sachin spots SQL DB is full

  • 12:20 ish stuff is removed from database and it starts working again

  • 11:23:43 - uptime robot reports TIS back online*

Action Items

Action Items

Owner

Action Items

Owner

Fix monitoring:

Alertmanager should send to #monitoring-prod rather than #monitoring?

Uptime robot didn’t report outage until keycloak was unavailable

Error messages need to be clear

 

Look at disk management

 

Decide on bigger disk?

 

Lessons Learned

  • If we want more sophisticated monitoring on services then we have to either see if the API for uptime robot will be able to support this or look at another product. API info here: https://uptimerobot.com/api/

  • * Not sure on timezone, presume UTC