Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

13 May 2021

Authors

Joseph (Pepe) Kelly

Status

In Progress

Summary

https://hee-tis.atlassian.net/browse/TIS21-1578

Impact

NDW was a day out of date for some records. Revalidation and GMC Connections data was out-of-date

Table of Contents

Non-technical Description

A number of nightly jobs didn’t run because the application that runs them died. We restarted the application and manually triggered the jobs that are usually run on a schedule.

...

Trigger

  • Jenkins died: Service Unavailable / not functioning properly

...

Detection

Build Server: Alerting via slack and team messages (Scrum Master)

...

Reval: User message on MS Teams

...

Resolution

  • Restarted Service

  • Reran NDW and GMC jobs

...

Timeline

  • ???

  • : 15:16 BST - Jenkins raising exceptions

  • : 00:17 BST - Jenkins stopped logging

  • : 06:55 BST - Super-Scrum master flagged not only downtime but also additional consequences

    • There was an unrelated failure mentioned (STAGE PersonSync job)

  • : 07:25 BST - Jenkins restarted

  • : 07:30 BST - NDW jobs restarted

  • : 08:24 BST - Question about Revalidation jobs raised on MS Teams

  • : 08:25 BST - NDW jobs finished

  • : ??:?? BST - gmc-sync jobs rerun

  • : ??:?? BST - confirmed with reval users data had been refreshed

  • : 10:27 BST - Downstream NDW ETLs finished

Root Cause(s)

  • NDW & Reval jobs didn’t run

    • Jenkins was unresponsive; service was up but not doing anything

      • Possibly not having write permissions?

      • Server Memory usage

      • Lots of builds were running (Reuben Roberts going to look at dependabot builds)

        • Builds taking longer than they should have done not finishing

...

Action Items

Action Items

Owner

Ticket ref

Establish how to manage dependabot PRs:

  • what manual tests/checks?

  • should we routinely rebase?

  • what do we do with failing PR builds?

This might be covered best in the Dev Handbook.

Change the dependabot config: don’t auto-rebase (Existing dependabot Tech Improvement)

  • ESR to start with.

Move nightly jobs away from Jenkins/build server i.e. ECS

...

Lessons Learned