Date | 13 May 2021 |
Authors | |
Status | Documenting |
Summary | |
Impact | NDW was a day out of date for some records. Revalidation and GMC Connections data was out-of-date |
Non-technical Description
A number of nightly jobs didn’t run because the application that runs them died. We restarted the application and manually triggered the jobs that are usually run on a schedule.
Trigger
Jenkins died: Service Unavailable / not functioning properly
Detection
Build Server: Alerting via slack and team messages (Scrum Master)
Reval: User message on MS Teams
Resolution
Restarted Service
Reran NDW and GMC jobs
Timeline
???
: 15:16 BST - Jenkins raising exceptions
: 00:17 BST - Jenkins stopped logging
: 06:55 BST - Super-Scrum master flagged not only downtime but also additional consequences
There was an unrelated failure mentioned (STAGE PersonSync job)
: 07:25 BST - Jenkins restarted
: 07:30 BST - NDW jobs restarted
: 08:24 BST - Question about Revalidation jobs raised on MS Teams
: 08:25 BST - NDW jobs finished
: 08:50- BST - gmc-sync jobs rerun
: 09:27 BST - confirmed with reval users data had been refreshed
: 10:27 BST - Downstream NDW ETLs finished
Root Cause(s)
NDW & Reval jobs didn’t run
Jenkins was unresponsive; service was up but not doing anything
Possibly not having write permissions?
Server Memory usage
Possibly still more digging to do Reuben Roberts
Lots of builds were running (Reuben Roberts going to look at dependabot builds)
Builds taking longer than they should have done not finishing
Action Items
Action Items | Owner | Ticket ref |
---|---|---|
Establish how to manage dependabot PRs:
This might be best to cover in the Dev Handbook. | ||
Change the dependabot config: don’t auto-rebase (Existing dependabot Tech Improvement)
| ||
Move nightly jobs away from Jenkins/build server i.e. ECS | ||
Lessons Learned
We’ve done a more complete RCA compared to when a similar outage happened: https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/1936687204/2020-08-03+TIS+NDW+ETLs+didn+t+run
0 Comments