Date	26 Jan 2022
Authors	Joseph (Pepe) Kelly (plus those mentioned on the page)
Status	Verifying in Stage & Prod?
Summary	Dependency change?. The overnight jobs failed to run so some information was several days “stale” in TIS and downstream systems (NDW)
Impact	For several days, some data reflected the state as of 21st Jan.

Non-technical Description

The overnight sync procedure for TIS was unable to run during the period 22 Jan 2022 to 26 Jan 2022. As such, automated updates to data and the person search page were not performed during that time, resulting in some stale data being presented to users (of the person search function, for example).

The stale data was also transferred to the NDW. Once the sync jobs were run manually, all the data updates were completed successfully.

Trigger

A faulty upgrade to one of the project dependencies broke the scheduling.

Detection

Routine check at 07:40 on 26 Jan 2022 found the absence of normal TIS-SYNC-SERVICE Slack messages from 22 Jan 2022 to 26 Jan 2022.
Checks of logging information and slack shows that the jobs did not run properly 22nd Jan - 26th Jan.

Resolution

Re-ran sync jobs and then NDW ETL
The faulty dependency was rolled-back.

Timeline

21 Jan 2022 - Dependabot Pull Request was merged
22 Jan 2022to 26 Jan 2022 Nightly sync jobs fail
26 Jan 2022 07:40 BST - The missing Sync job Slack messages were noted
26 Jan 2022 08:00 BST - Investigation revealed that the triggering mechanism within a synchronization service failed.
26 Jan 2022 08:01 BST The HEE Sync jobs were started manually
26 Jan 2022 08:37 BST The NIMDTA Sync jobs were started manually
26 Jan 2022 08:45 BST The NDW-ETL job (production) was rerun manually
26 Jan 2022 08:50 BST - Users informed that jobs had completed and TIS operating as normal
26 Jan 2022 09:17-09:34 BST - Breaking change reverted
26 Jan 2022 09:30 BST - NDW ETL finishes. NDW team informed.

Root Cause(s)

No messages were received in Slack
Lack of messages wasn’t picked up for several days
Job couldn’t start, despite the cron schedule firing
Diagnosis hampered by the split between Serverless runtime environments and VM environments
We didn’t even get a start message as it is only generated from within the job (at the start and end)
Major version upgrade of a dependency was missing a necessary class at runtime despite passing CI tests.
- Even manual verifications wouldn’t pick this up
Tests don’t cover the scheduling functionality

Action Items

Action Items	Owner
Introduce testing of the scheduled components / Tests that verify the job runs add a manual step to run jobs from a one-off cron expression (only if automated tests can’t be done)	Jayanta Saha A enhancement of tests ticket is created here https://hee-tis.atlassian.net/browse/TIS21-2624
Look at how we do scheduling across all the TIS stuff, possibly: Use an external scheduler / verifier (e.g.CloudWatch Events) send “start/failed to start” slack message earlier and “completed/errored” slack message at a later point to pick up more exceptions, with specific codes.	Reuben Roberts Investigation ticket created: https://hee-tis.atlassian.net/browse/TIS21-2623
Review responsibilities around checking jobs/slack, e.g.: Sharing what people look out for Reminding team of norms / expectations about checking application health How to quickly find what is running where? Have named people per week to check?	Marcello Fabbri (Unlicensed) Yafang Deng Reuben Roberts Jayanta Saha
Has the daily check for “completed” messages stopped running?	Reuben Roberts This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging. Discussions with John Simmons (Deactivated) led to this ticket: https://hee-tis.atlassian.net/browse/TIS21-2621
~~Move all logging / ship all logs to CloudWatch~~
~~Have a documented place for where everything runs, e.g. handbook, Infra diagrams?~~
Tidy up definitions for ECS clusters (services with instance count = 0)	Marcello Fabbri (Unlicensed)

2022-01-26 Overnight jobs failed to run