Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 8 Next »

Date
 
Authors
StatusIn progress/Complete
SummaryFailures across infrastructure causing issues with multiple services.
ImpactDevelopment and other ETL runs may/may not have run.

Root Cause

To be investigated.

Trigger

To be investigated.

Resolution

05:29 - Restarted the VM. Looks like it's back.

05:32 - Restarted Intrepid ETLs that failed.

05:34 - Started recovery of 10.150.0.137/8's docker.

05:41 - Earlier ETLs started failed due to docker issue.

To be discovered.

Detection

Failures on a number of services over the weekend:

intrepid-extract-clean - #56 Failure after 20 min (Open)

site-dev - #524 Failure after 22 min (Open)

devops - #2258 Failure after 4 min 6 sec (Open)

service-registry - #95429 Failure after 2.8 sec (Open)


sshd not reponding on 10.140.0.136

Action Items

Action ItemTypeOwnerIssue
Use docker from their apt rather than ubuntu packaged (docker-ce rather than docker.io)mitigate/prevent









Timeline

Supporting Information

e.g. monitoring dashboards

https://build.tis.nhs.uk/jenkins/blue/organizations/jenkins/devops/detail/devops/2258/pipeline shouldn't use rsync to sync with both machines, plus if one goes down the other fails..

We REALLY shouldn't be using the default docker from apt..


We shouldn't be running docker like this on the host

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.