Date | |
Authors | Joseph (Pepe) Kelly (plus those mentioned on the page) |
Status | Verifying in Stage & Prod? |
Summary | Dependency change?. The overnight jobs failed to run so some information was several days “stale” in TIS and downstream systems (NDW) |
Impact | For several days, some data reflected the state as of 21st Jan. |
Non-technical Description
The overnight sync procedure for TIS was unable to run during the period to . As such, automated updates to data and the person search page were not performed during that time, resulting in some stale data being presented to users (of the person search function, for example).
The stale data was also transferred to the NDW. Once the sync jobs were run manually, all the data updates were completed successfully.
Trigger
A faulty upgrade to one of the project dependencies broke the scheduling.
Detection
Routine check at 07:40 on found the absence of normal TIS-SYNC-SERVICE Slack messages from to .
Checks of logging information and slack shows that the jobs did not run properly 22nd Jan - 26th Jan.
Resolution
Re-ran sync jobs and then NDW ETL
The faulty dependency was rolled-back.
Timeline
- Dependabot Pull Request was merged
to Nightly sync jobs fail
07:40 BST - The missing Sync job Slack messages were noted
08:00 BST - Investigation revealed that the triggering mechanism within a synchronization service failed.
08:01 BST The HEE Sync jobs were started manually
08:37 BST The NIMDTA Sync jobs were started manually
08:45 BST The NDW-ETL job (production) was rerun manually
08:50 BST - Users informed that jobs had completed and TIS operating as normal
09:17-09:34 BST - Breaking change reverted
09:30 BST - NDW ETL finishes. NDW team informed.
Root Cause(s)
No messages were received in Slack
Lack of messages wasn’t picked up for several days
Job couldn’t start, despite the cron schedule firing
Diagnosis hampered by the split between Serverless runtime environments and VM environments
We didn’t even get a start message as it is only generated from within the job (at the start and end)
Major version upgrade of a dependency was missing a necessary class at runtime despite passing CI tests.
Even manual verifications wouldn’t pick this up
Tests don’t cover the scheduling functionality
Action Items
Action Items | Owner |
---|---|
| A enhancement of tests ticket is created here |
Look at how we do scheduling across all the TIS stuff, possibly:
| Reuben Roberts |
Review responsibilities around checking jobs/slack, e.g.:
| Marcello Fabbri (Unlicensed) Yafang Deng Reuben Roberts Jayanta Saha |
Has the daily check for “completed” messages stopped running? | This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging. |
| |
| |
Tidy up definitions for ECS clusters (services with instance count = 0) |
Add Comment