Date | 13 May 2021 |
Authors | |
Status | In ProgressDone |
Summary | |
Impact | NDW was a day out of date for some records. Revalidation and GMC Connections data was out-of-date |
Table of Contents |
---|
Non-technical Description
A number of nightly jobs didn’t run because the application that runs them died. We restarted the application and manually triggered the jobs that are usually run on a schedule.
...
Trigger
Jenkins died: Service Unavailable / not functioning properly
...
Detection
Build Server: Alerting via slack and team messages (Scrum Master)
...
Reval: User message on MS Teams
...
Resolution
Restarted Service
Reran NDW and GMC jobs
...
Timeline
???
: 15:16 BST - Jenkins raising exceptions
: 00:17 BST - Jenkins stopped logging
: 06:55 BST - Super-Scrum master flagged not only downtime but also additional consequences
There was an unrelated failure mentioned (STAGE PersonSync job)
: 07:25 BST - Jenkins restarted
: 07:30 BST - NDW jobs restarted
: 08:24 BST - Question about Revalidation jobs raised on MS Teams
: 08:25 BST - NDW jobs finished
: ??:?? 08:50- BST - gmc-sync jobs rerun
: ??09:?? 27 BST - confirmed with reval users data had been refreshed
: 10:27 BST - Downstream NDW ETLs finished
Root Cause(s)
NDW & Reval jobs didn’t run
Jenkins was unresponsive; service was up but not doing anything
Possibly not having write permissions?
Server Memory usage
Possibly still more digging to do Reuben Roberts
Jenkins had crashed due to an out-of-memory error. From the syslog:
May 12 23:18:25 HEE-TIS-VM-JENKINS kernel: [1052434.449590] Out of memory: Kill process 1641 (java) score 49 or sacrifice child
May 12 23:18:25 HEE-TIS-VM-JENKINS kernel: [1052434.454439] Killed process 1641 (java) total-vm:10346296kB, anon-rss:2037444kB, file-rss:0kB
...
May 12 23:19:20 HEE-TIS-VM-JENKINS jenkins: jenkins: fatal: client (pid 1641) killed by signal 9, exiting
When it restarted, it had lost write-permission on some of the data folders. From the jenkins log:
2021-05-13 06:26:29.918+0000 [id=22] INFO jenkins.model.Jenkins#<init>: deleting obsolete workspace /home/jenkins/data/jenkins/workspace/E_TIS-EsrDataExportService_PR-64
2021-05-13 06:26:30.562+0000 [id=22] WARNING jenkins.model.Jenkins#<init>: Exception in onOnline() for the computer listener class jenkins.branch.WorkspaceLocatorImpl$Collector on the Jenkins master node
Also: java.nio.file.FileSystemException: /home/jenkins/data/jenkins/workspace/E_TIS-EsrDataExportService_PR-64/build/reports/tests/test/packages/com.hee.tis.esr.esrdataexport.integration.notification.html: Operation not permitted
Lots of builds were running (Reuben Roberts going to look at dependabot builds)
Builds taking longer than they should have done not finishing...
Action Items
Action Items | Owner | Ticket ref |
---|---|---|
Establish how to manage dependabot PRs:
This might be covered best to cover in the Dev Handbook. | ||
Change the dependabot config: don’t auto-rebase (Existing dependabot Tech Improvement)
| ||
Move nightly jobs away from Jenkins/build server i.e. ECS | https://hee-tis.atlassian.net/browse/TIS21-1318 will probably encompass https://hee-tis.atlassian.net/browse/TIS21-1587 | |
[Unrelated] Check Elastic Search snapshots are configured as needed: Maintenance windows avoid Sync job on stage and Prod (small subtask) |
...
Lessons Learned
We’ve done going further in RCA over a more complete RCA compared to when a similar outage happened: https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/1936687204/2020-08-03+TIS+NDW+ETLs+didn+t+run