Date | 2nd October 2017 |
Authors | Chris Mills |
Status | In progressCompleted |
Summary | Due to issues with Keycloak on Thursday multiple core containers were down on production monitoring so we weren't able to see production applicaiton issues |
Impact | Due to not being able to see anything was down. No users were able to access the frontend system. |
...
Action Item | Type | Owner | Issue | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Add monitoring to the monitoring server so that it's active at all times | prevent | Fayaz Abdul (Unlicensed) |
| ||||||||
Improve monitoring output for clearer issue tracking | mitigate | Chris Mills (Unlicensed) | |||||||||
Move monitoring onto alternate VNET |
Timeline
Sept 28th
Keycloak issues seeĀ 2017-09-28 Keycloak database backup failed
Oct 2nd
8:56AM Reuben Noot noticed that the production ui wasn't accessible
9:15AM We broke out into seperate channel to debug issue. Admins-UI, keycloak and TCS were restarted.
10:02AM Graham noticed monitoringĀ didn't fire and stated that the docker processes restarted 44 hours ago.
10:10AM Noticed that the security updates also weren't running after he checked if this was the cause of the problem.
10:17AM Chris brought monitoring server back up
10:43AM Fayaz put root cause at problems with keycloak previously as the last alert from Promethus was Sept 27th at 08:36
11:00AM We restarted the containers and the majority of the issues were resolved
11:15AM We added tickets to create monitoring checks, add security updates back in, improve clearer monitoring notifications and move the monitoring over onto the tools VNET.
Supporting Information
@alex.dobre @fayaz @chrism not sure why yet, but a few of the containers were down on production this morning
...