Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Date
 
Authors
StatusIn progress/Complete
SummaryFailures across infrastructure causing issues with multiple services.
ImpactDevelopment and other ETL runs may/may not have run.

Root Cause

To be investigated.

Trigger

To be investigated.

Resolution

05:29 - Restarted the VM. Looks like it's back.

05:32 - Restarted Intrepid ETLs that failed.

05:34 - Started recovery of 10.150.0.137/8's docker.

To be discovered.

Detection

Failures on a number of services over the weekend:

intrepid-extract-clean - #56 Failure after 20 min (Open)

site-dev - #524 Failure after 22 min (Open)

devops - #2258 Failure after 4 min 6 sec (Open)

service-registry - #95429 Failure after 2.8 sec (Open)


sshd not reponding on 10.140.0.136

Action Items

Action ItemTypeOwnerIssue

mitigate/prevent









Timeline

Supporting Information

e.g. monitoring dashboards

https://build.tis.nhs.uk/jenkins/blue/organizations/jenkins/devops/detail/devops/2258/pipeline shouldn't use rsync to sync with both machines, plus if one goes down the other fails..

We REALLY shouldn't be using the default docker from apt..

  • No labels