Date	09 Apr 2018
Authors	Chris Mills (Unlicensed)
Status	In progress/Complete
Summary	Failures across infrastructure causing issues with multiple services.
Impact	Development and other ETL runs may/may not have run.

Table of Contents

Root Cause

...

To be investigated.

Resolution

To be decided05:29 - Restarted the VM. Looks like it's back.

05:32 - Restarted Intrepid ETLs that failed.

05:34 - Started recovery of 10.150.0.137/8's docker.

05:41 - Earlier ETLs started failed due to docker issue.

05:47 - Docker recovery worked on ETL box. Need to script now for other machines. Intrepid DR ETL being run at same time then onto Consolidated.

05:50 - Looks like site-dev interacts with the ETL box which could have bumped the version of docker from apt security and killed it. I mean apt security is good but those machines weren't configured correctly in the first place to handle normal operating procedure.

failed: [10.140.0.136]

06:04 - Intrepid DR ETL looks like it's been running in dev for a while....

To be discovered.

Detection

Failures on a number of services over the weekend:

...

sshd not reponding on 10.140.0.136

Action Items

Action Item	Type	Owner	Issue
Use docker from their apt rather than ubuntu packaged (docker-ce rather than docker.io) Ansible job.	mitigate/prevent
Correct setup scripts rather than awful seperated ones
Better understanding of ETLs needs to be sorted
Retire https://github.com/Health-Education-England/TIS-DEVOPS/blob/master/ansible/roles/docker-host/tasks/main.yml

Timeline

Supporting Information

e.g. monitoring dashboards

https://build.tis.nhs.uk/jenkins/blue/organizations/jenkins/devops/detail/devops/2258/pipeline shouldn't use rsync to sync with both machines, plus if one goes down the other fails..

We REALLY shouldn't be using the default docker from apt..

We shouldn't be running docker like this on the host

Versions Compared

Old Version 1

New Version Current

Key

Root Cause

Resolution

Detection

Action Items

Timeline

Supporting Information

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Root Cause

Resolution

Detection

Action Items

Timeline

Supporting Information