Date | |
Authors | |
Status | In progress/Complete |
Summary | Failures across infrastructure causing issues with multiple services. |
Impact | Development and other ETL runs may/may not have run. |
Root Cause
To be investigated.
Trigger
To be investigated.
Resolution
05:29 - Restarted the VM. Looks like it's back.
05:32 - Restarted Intrepid ETLs that failed.
To be discovered.
Detection
Failures on a number of services over the weekend:
intrepid-extract-clean - #56 Failure after 20 min (Open)
site-dev - #524 Failure after 22 min (Open)
devops - #2258 Failure after 4 min 6 sec (Open)
service-registry - #95429 Failure after 2.8 sec (Open)
sshd not reponding on 10.140.0.136
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
mitigate/prevent | |||
Timeline
Supporting Information
e.g. monitoring dashboards
https://build.tis.nhs.uk/jenkins/blue/organizations/jenkins/devops/detail/devops/2258/pipeline shouldn't use rsync to sync with both machines, plus if one goes down the other fails..
0 Comments