Date	2nd October 2017
Authors	Chris Mills
Status	In progress
Summary	Due to issues with Keycloak on Thursday multiple core containers were down on production monitoring so we weren't able to see production applicaiton issues
Impact	Due to not being able to see anything was down. No users were able to access the frontend system.

Root Cause

The root cause is split into sections due to the parts this incident affects.

Monitoring:

The monitoring containers not running were as result of redeploying the stack in relation to the keycloak issue on the 28th. (See the following for more info 2017-09-28 Keycloak database backup failed.)

Prod Applications:

A number of prod applications were not restored correctly after the issue with Keycloak on the 28th.

Trigger

The ultimate trigger we believe was the incident on the 28th caused by Keycloak database issues and a number of core services not being brought back up correctly.

Resolution

Restarting each service solved the problem we were having.

Detection

We were notified by Reuben Noot (Deactivated) who stated that he was unable to view production when he noted on slack

Reuben Noot [8:56 AM]
production doesn't look to healthy - getting internal server error just trying to get to https://apps.tis.nhs.uk/ui/

Chris noticed that the monitoring service wasnt running and restarted it.

Chris Mills [10:10 AM]
Promethus isn't running

Fayaz noticed a number of other containers weren't running after we restored most of the other services.

Fayaz Abdul [10:46 AM]
only ui seems to be down

[10:46]
and @graham restarted tcs and admins-ui

Fayaz Abdul [10:58 AM]
kibana route is also closed as some of the logging container are also didn’t started after 27-Sep

Action Items

Action Item	Type	Owner	Issue
Add monitoring to the monitoring server so that it's active at all times	prevent	Fayaz Abdul (Unlicensed)	TISDEV-2684 - Getting issue details... STATUS
Improve monitoring output for clearer issue tracking	mitigate	Chris Mills (Unlicensed)

Timeline

Sept 28th

Keycloak issues see 2017-09-28 Keycloak database backup failed

Oct 2nd

8:56AM Reuben Noot noticed that the production ui wasn't accessible

9:15AM We broke out into seperate channel to debug issue. Admins-UI, keycloak and TCS were restarted.

10:02AM Graham noticed monitoring didn't fire and stated that the docker processes restarted 44 hours ago.

10:10AM Noticed that the security updates also weren't running after he checked if this was the cause of the problem.

10:17AM Chris brought monitoring server back up

10:43AM Fayaz put root cause at problems with keycloak previously as the last alert from Promethus was Sept 27th at 08:36

11:00AM We restarted the containers and the majority of the issues were resolved

11:15AM We added tickets to create monitoring checks, add security updates back in, improve clearer monitoring notifications and move the monitoring over onto the tools VNET.

Supporting Information

@alex.dobre @fayaz @chrism not sure why yet, but a few of the containers were down on production this morning

16
TCS, admins-ui and keycloak have had to be restarted, I’, checking to see if there is anything else that should be running

17
@fayaz and @U6ZUL8LFR can you investigate why monitoring didn’t fire it’d be good if the two of you went through that together

Graham O'Regan 02 AM
@U6ZUL8LFR i was just chatting to @fayaz about the problem this morning

02
it looks like the docker process restarted 44hours ago but when it restarts some containers don’t come back up automatically

Chris Mills 03 AM
hmm ok cool

Graham O'Regan 03 AM https://github.com/Health-Education-England/TIS-DEVOPS/blob/master/ansible/tasks/app-restart.yml

03
i put thtat together last week when i was having the DB problems because i need to bring it up cleanly

04
i haven’t investigated why those four don’t, might be worth having a dig into it at some point

Chris Mills 10 AM
Promethus isn't running

Graham O'Regan 10 AM
there is a task that we should be running nightly, i thought there was a jenkins job running this https://github.com/Health-Education-England/TIS-DEVOPS/blob/master/ansible/tasks/ubuntu-security-updates.yml

Fayaz Abdul 10 AM
we do the docker cleanup task every friday night 9PM, but the times aren’t matching

Graham O'Regan 10 AM
i suspected that a security update over the weekend had restarted the process but that doens’t seem to be the case

Fayaz Abdul 11 AM
but saying this stage seems to be fine, if the docker cleanup was the problematic one then we should notice the same on stage too

Chris Mills
14 AM
did we turn monserver off on purpose

14
or just failed?

Graham O'Regan 16 AM
true, not sure what stopped it, the uptime on the box is 13 days

Chris Mills 17 AM
yeah strange, monserver only had statsd running (edited)

Fayaz Abdul 18 AM
docker system prune -af runs every friday

Graham O'Regan 18 AM
can you guys put together an incident log for this?

Chris Mills
19 AM
sure

Graham O'Regan 19 AM
odd, tho, the process restarted around lunch on saturday

20
can you see when the mon server went down?

Chris Mills 21 AM
the old containers were swept but I'm looking elsewhere

Fayaz Abdul 41 AM
it seems monitoring is down from the day we had keycloak prod issue, may be as part of restarts at night something messed up

Fayaz Abdul 43 AM
even from monitoring channel alerts Sep27th 08:36 was the last alert from prometheus, so its down and the rootcause is restart of docker and redeploying the stack

43
So when the prod went down over the weekend monitoring didn’t alerted as its not running

43
@chrism @U23HY7421 :point_up:

44
now investigation on what went wrong on prod, without monitoring over the period its a bit tricky, going through the logs

45
@graham: what you did this morning to bring prod backup

Chris Mills 45 AM
@fayaz what logs are you looking at so we're not looking at the same.

46
most of them were cleared

Fayaz Abdul 46 AM
now I need to determine why the ui on prod is down

46
we got the answer for monitoring

46
when I logged in this morning revalidation is working fine

Chris Mills
46 AM
@here https://hee-tis.atlassian.net/wiki/spaces/TISDEV/pages/109772806/2017-10-02+Monitoring+Apps+Production+Container+failure incident log we're updating as we go. If you get information let me know and I'll update.

Fayaz Abdul 46 AM
only ui seems to be down

46
and @graham restarted tcs and admins-ui

Chris Mills 47 AM
yeah cool

Fayaz Abdul 58 AM
kibana route is also closed as some of the logging container are also didn’t started after 27-Sep

Graham O'Regan 58 AM https://hee-tis.atlassian.net/browse/TISDEV-2684

59
created a ticket to pick up a basic monitor, this could be as simple as an ansible task in jenkins again that tries to connect, if it works then it can assume that alert manager will pick up the rest

Chris Mills 01 AM
I've added it to the report

Chris Mills
24 AM
```6b88d03e3b5d grafana/grafana "/run.sh" About an hour ago Exited (0) 15 minutes ago monserver_grafana_1
1eb8fe54a663 prom/prometheus "/bin/prometheus -..." About an hour ago Exited (0) 15 minutes ago monserver_prometheus_1
5b5f2bb2f17b prom/pushgateway "/bin/pushgateway ..." About an hour ago Exited (0) 15 minutes ago monserver_pushgateway_1```
(edited)

24
they exited again, just looking why

Graham O'Regan 25 AM
@chrism ^ @fayaz’s restart?

Chris Mills 25 AM
oh yeah :facepalm:

26
haven't come up back up yet though has site run yet?

Graham O'Regan 30 AM
think so, just the monitoring servers?

Chris Mills
30 AM
yup

30
statsd came back

Chris Mills
43 AM
For the check btw we can get https://monitoring.tis.nhs.uk/alertmanager/api/v1/status if needed.

Graham O'Regan 50 AM
i’ve been coming back to the idea of separating out all the tasks from the build, we discussed doing it on a seprate jenkins, wut?

51
@fayaz @chrism^ ?

52
we need to get that security updates job running again too

Chris Mills
53 AM
For the time being we could seperate them all out into a view

53
Oh they already are :facepalm:

Chris Mills
59 AM
I've created the job @graham we happy with 2:30am?

Graham O'Regan 02 PM
earlier might be better, try it at 11:30, if they fail then all the etls will fail too. the gmc runs earlier than the rest so we could get into a state when the gmc runs but the rest doesn’t

Chris Mills
04 PM
ok cool thank

Fayaz Abdul 25 PM
@graham @chrism, seems to have some issue with docker_login ansible module, looking into it, as its not able to authorize using the module, creds when used independently are working fine

2017-10-02 Monitoring Failure caused oversight of application failure