Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Date

13 May 2021

Authors

Joseph (Pepe) Kelly

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-1578

Impact

NDW was a day out of date for some records. Revalidation and GMC Connections data was out-of-date

Non-technical Description

A number of nightly jobs didn’t run because the application that runs them died. We restarted the application and manually triggered the jobs that are usually run on a schedule.


Trigger

  • Jenkins died: Service Unavailable / not functioning properly


Detection

Build Server: Alerting via slack and team messages (Scrum Master)

Reval: User message on MS Teams


Resolution

  • Restarted Service

  • Reran NDW and GMC jobs


Timeline

  • ???

  • : 00:17 BST - Jenkins stopped logging

  • : 06:55 BST - Super-Scrum master flagged not only downtime but also additional consequences

    • There was an unrelated failure mentioned (STAGE PersonSync job)

  • : 07:25 BST - Jenkins restarted

  • : 07:30 BST - NDW jobs restarted

  • : 08:24 BST - Question about Revalidation jobs raised on MS Teams

  • : 08:25 BST - NDW jobs finished

  • : 08:50- BST - gmc-sync jobs rerun

  • : 09:27 BST - confirmed with reval users data had been refreshed

  • : 10:27 BST - Downstream NDW ETLs finished

Root Cause(s)

  • NDW & Reval jobs didn’t run

  • Jenkins was unresponsive; service was up but not doing anything

  • Jenkins had crashed due to an out-of-memory error. From the syslog:

    • May 12 23:18:25 HEE-TIS-VM-JENKINS kernel: [1052434.449590] Out of memory: Kill process 1641 (java) score 49 or sacrifice child
      May 12 23:18:25 HEE-TIS-VM-JENKINS kernel: [1052434.454439] Killed process 1641 (java) total-vm:10346296kB, anon-rss:2037444kB, file-rss:0kB
      ...
      May 12 23:19:20 HEE-TIS-VM-JENKINS jenkins: jenkins: fatal: client (pid 1641) killed by signal 9, exiting

  • When it restarted, it had lost write-permission on some of the data folders. From the jenkins log:

    • 2021-05-13 06:26:29.918+0000 [id=22] INFO jenkins.model.Jenkins#<init>: deleting obsolete workspace /home/jenkins/data/jenkins/workspace/E_TIS-EsrDataExportService_PR-64
      2021-05-13 06:26:30.562+0000 [id=22] WARNING jenkins.model.Jenkins#<init>: Exception in onOnline() for the computer listener class jenkins.branch.WorkspaceLocatorImpl$Collector on the Jenkins master node
      Also: java.nio.file.FileSystemException: /home/jenkins/data/jenkins/workspace/E_TIS-EsrDataExportService_PR-64/build/reports/tests/test/packages/com.hee.tis.esr.esrdataexport.integration.notification.html: Operation not permitted


Action Items

Action Items

Owner

Ticket ref

Establish how to manage dependabot PRs:

  • what manual tests/checks?

  • should we routinely rebase?

  • what do we do with failing PR builds?

This might be best to cover in the Dev Handbook.

Change the dependabot config: don’t auto-rebase (Existing dependabot Tech Improvement)

  • ESR to start with.

Move nightly jobs away from Jenkins/build server i.e. ECS


Lessons Learned

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.