2022-01-26 Overnight jobs failed to run
Date | Jan 26, 2022 |
Authors | @Joseph (Pepe) Kelly (plus those mentioned on the page) |
Status | Done |
Summary | Dependency change broke the scheduling. The overnight jobs failed to run so some information was several days “stale” in TIS and downstream systems (NDW) |
Impact | For several days, some data reflected the state as of 21st Jan. |
Non-technical Description
The overnight sync procedure for TIS was unable to run during the period Jan 22, 2022 to Jan 26, 2022. As such, automated updates to data and the person search page were not performed during that time, resulting in some stale data being presented to users (of the person search function, for example).
The stale data was also transferred to the NDW. Once the sync jobs were run manually, all the data updates were completed successfully.
Trigger
A faulty upgrade to one of the project dependencies broke the scheduling.
Detection
Routine check at 07:40 on Jan 26, 2022 found the absence of normal TIS-SYNC-SERVICE Slack messages from Jan 22, 2022 to Jan 26, 2022.
Checks of logging information and slack shows that the jobs did not run properly 22nd Jan - 26th Jan.
Resolution
Re-ran sync jobs and then NDW ETL
The faulty dependency was rolled-back.
Timeline
Jan 21, 2022 - Dependabot Pull Request was merged
Jan 22, 2022to Jan 26, 2022 Nightly sync jobs fail
Jan 26, 2022 07:40 BST - The missing Sync job Slack messages were noted
Jan 26, 2022 08:00 BST - Investigation revealed that the triggering mechanism within a synchronization service failed.
Jan 26, 2022 08:01 BST The HEE Sync jobs were started manually
Jan 26, 2022 08:37 BST The NIMDTA Sync jobs were started manually
Jan 26, 2022 08:45 BST The NDW-ETL job (production) was rerun manually
Jan 26, 2022 08:50 BST - Users informed that jobs had completed and TIS operating as normal
Jan 26, 2022 09:17-09:34 BST - Breaking change reverted
Jan 26, 2022 09:30 BST - NDW ETL finishes. NDW team informed.
Root Cause(s)
No messages were received in Slack
Lack of messages wasn’t picked up for several days
Job couldn’t start, despite the cron schedule firing
Diagnosis hampered by the split between Serverless runtime environments and VM environments
We didn’t even get a start message as it is only generated from within the job (at the start and end)
Major version upgrade of a dependency was missing a necessary class at runtime despite passing CI tests.
Even manual verifications wouldn’t pick this up
Tests don’t cover the scheduling functionality
Action Items
Action Items | Owner |
---|---|
| @Jayanta Saha Enhancement of tests ticket is created here |
Look at how we do scheduling across all the TIS stuff, possibly:
| @Reuben Roberts |
Review responsibilities around checking jobs/slack, e.g.:
| @Marcello Fabbri (Unlicensed) @Yafang Deng @Reuben Roberts @Jayanta Saha |
Has the daily check for “completed” messages stopped running? | @Reuben Roberts This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging. |
Move all logging / ship all logs to CloudWatch |
|
Have a documented place for where everything runs, e.g. handbook, Infra diagrams? |
|
Tidy up definitions for ECS clusters (services with instance count = 0) | @Marcello Fabbri (Unlicensed) |
Lessons Learned
Group review for RCA and identifying action items from the root causes is very useful.
Related pages
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213