Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Date2nd October 2017
AuthorsChris Mills
StatusIn progressCompleted
SummaryDue to issues with Keycloak on Thursday multiple core containers were down on production monitoring so we weren't able to see production applicaiton issues
ImpactDue to not being able to see anything was down. No users were able to access the frontend system.

...

Action ItemTypeOwnerIssue
Add monitoring to the monitoring server so that it's active at all timespreventFayaz Abdul (Unlicensed)
Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTISDEV-2684
Improve monitoring output for clearer issue trackingmitigateChris Mills (Unlicensed)
Move monitoring onto alternate VNET


Timeline

Sept 28th

Keycloak issues seeĀ 2017-09-28 Keycloak database backup failed

Oct 2nd

8:56AM Reuben Noot noticed that the production ui wasn't accessible

9:15AM We broke out into seperate channel to debug issue. Admins-UI, keycloak and TCS were restarted.

10:02AM Graham noticed monitoringĀ  didn't fire and stated that the docker processes restarted 44 hours ago.

10:10AM Noticed that the security updates also weren't running after he checked if this was the cause of the problem.

10:17AM Chris brought monitoring server back up

10:43AM Fayaz put root cause at problems with keycloak previously as the last alert from Promethus was Sept 27th at 08:36

11:00AM We restarted the containers and the majority of the issues were resolved

11:15AM We added tickets to create monitoring checks, add security updates back in, improve clearer monitoring notifications and move the monitoring over onto the tools VNET.


Supporting Information

@alex.dobre @fayaz @chrism not sure why yet, but a few of the containers were down on production this morning

...