Date | |
Authors | |
Status | In progress/Complete |
Summary | Failures across infrastructure causing issues with multiple services. |
Impact | Development and other ETL runs may/may not have run. |
Root Cause
To be investigated.
Trigger
To be investigated.
Resolution
05:29 - Restarted the VM. Looks like it's back.
05:32 - Restarted Intrepid ETLs that failed.
05:34 - Started recovery of 10.150.0.137/8's docker.
To be discovered.
Detection
Failures on a number of services over the weekend:
intrepid-extract-clean - #56 Failure after 20 min (Open)
site-dev - #524 Failure after 22 min (Open)
devops - #2258 Failure after 4 min 6 sec (Open)
service-registry - #95429 Failure after 2.8 sec (Open)
sshd not reponding on 10.140.0.136
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
mitigate/prevent | |||
Timeline
Supporting Information
e.g. monitoring dashboards
https://build.tis.nhs.uk/jenkins/blue/organizations/jenkins/devops/detail/devops/2258/pipeline shouldn't use rsync to sync with both machines, plus if one goes down the other fails..
We REALLY shouldn't be using the default docker from apt..
We shouldn't be running docker like this on the host
Add Comment