/
2022-01-26 Overnight jobs failed to run

2022-01-26 Overnight jobs failed to run

Date

Jan 26, 2022

Authors

@Joseph (Pepe) Kelly (plus those mentioned on the page)

Status

Done

Summary

Dependency change broke the scheduling. The overnight jobs failed to run so some information was several days “stale” in TIS and downstream systems (NDW)

Impact

For several days, some data reflected the state as of 21st Jan.

Non-technical Description

The overnight sync procedure for TIS was unable to run during the period Jan 22, 2022 to Jan 26, 2022. As such, automated updates to data and the person search page were not performed during that time, resulting in some stale data being presented to users (of the person search function, for example).

The stale data was also transferred to the NDW. Once the sync jobs were run manually, all the data updates were completed successfully.

 


Trigger

  • A faulty upgrade to one of the project dependencies broke the scheduling.

Detection

  • Routine check at 07:40 on Jan 26, 2022 found the absence of normal TIS-SYNC-SERVICE Slack messages from Jan 22, 2022 to Jan 26, 2022.

  • Checks of logging information and slack shows that the jobs did not run properly 22nd Jan - 26th Jan.


Resolution

  • Re-ran sync jobs and then NDW ETL

  • The faulty dependency was rolled-back.


Timeline

  • Jan 21, 2022 - Dependabot Pull Request was merged

  • Jan 22, 2022to Jan 26, 2022 Nightly sync jobs fail

  • Jan 26, 2022 07:40 BST - The missing Sync job Slack messages were noted

  • Jan 26, 2022 08:00 BST - Investigation revealed that the triggering mechanism within a synchronization service failed.

  • Jan 26, 2022 08:01 BST The HEE Sync jobs were started manually

  • Jan 26, 2022 08:37 BST The NIMDTA Sync jobs were started manually

  • Jan 26, 2022 08:45 BST The NDW-ETL job (production) was rerun manually

  • Jan 26, 2022 08:50 BST - Users informed that jobs had completed and TIS operating as normal

  • Jan 26, 2022 09:17-09:34 BST - Breaking change reverted

  • Jan 26, 2022 09:30 BST - NDW ETL finishes. NDW team informed.


Root Cause(s)

  • No messages were received in Slack

  • Lack of messages wasn’t picked up for several days

  • Job couldn’t start, despite the cron schedule firing

  • Diagnosis hampered by the split between Serverless runtime environments and VM environments

  • We didn’t even get a start message as it is only generated from within the job (at the start and end)

  • Major version upgrade of a dependency was missing a necessary class at runtime despite passing CI tests.

    • Even manual verifications wouldn’t pick this up

  • Tests don’t cover the scheduling functionality


Action Items

Action Items

Owner

Action Items

Owner

  • Introduce testing of the scheduled components / Tests that verify the job runs

  • add a manual step to run jobs from a one-off cron expression (only if automated tests can’t be done)

@Jayanta Saha

Enhancement of tests ticket is created here

https://hee-tis.atlassian.net/browse/TIS21-2624

Look at how we do scheduling across all the TIS stuff, possibly:

  • Use an external scheduler / verifier (e.g.CloudWatch Events)

  • send “start/failed to start” slack message earlier and “completed/errored” slack message at a later point to pick up more exceptions, with specific codes.

@Reuben Roberts
Investigation ticket created: https://hee-tis.atlassian.net/browse/TIS21-2623

 Review responsibilities around checking jobs/slack, e.g.:

  • Sharing what people look out for

  • Reminding team of norms / expectations about checking application health

  • How to quickly find what is running where?

  • Have named people per week to check?

 @Marcello Fabbri (Unlicensed) @Yafang Deng @Reuben Roberts @Jayanta Saha

 Has the daily check for “completed” messages stopped running?

 @Reuben Roberts

This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging.
Discussions with @John Simmons (Deactivated) led to this ticket: https://hee-tis.atlassian.net/browse/TIS21-2621

Move all logging / ship all logs to CloudWatch

 

Have a documented place for where everything runs, e.g. handbook, Infra diagrams?

 

Tidy up definitions for ECS clusters (services with instance count = 0)

@Marcello Fabbri (Unlicensed)


Lessons Learned

  •  Group review for RCA and identifying action items from the root causes is very useful.

Related pages