Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Date

Authors

Joseph (Pepe) Kelly (plus those mentioned on the page)

Status

Verifying in Stage & Prod?

Summary

Dependency change?. The overnight jobs failed to run so some information was several days “stale” in TIS and downstream systems (NDW)

Impact

For several days, some data reflected the state as of 21st Jan.

Non-technical Description

The overnight sync procedure for TIS was unable to run during the period to . As such, automated updates to data and the person search page were not performed during that time, resulting in some stale data being presented to users (of the person search function, for example).

The stale data was also transferred to the NDW. Once the sync jobs were run manually, all the data updates were completed successfully.


Trigger

  • A faulty upgrade to one of the project dependencies broke the scheduling.

Detection

  • Routine check at 07:40 on found the absence of normal TIS-SYNC-SERVICE Slack messages from to .

  • Checks of logging information and slack shows that the jobs did not run properly 22nd Jan - 26th Jan.


Resolution

  • Re-ran sync jobs and then NDW ETL

  • The faulty dependency was rolled-back.


Timeline

  • - Dependabot Pull Request was merged

  • to Nightly sync jobs fail

  • 07:40 BST - The missing Sync job Slack messages were noted

  • 08:00 BST - Investigation revealed that the triggering mechanism within a synchronization service failed.

  • 08:01 BST The HEE Sync jobs were started manually

  • 08:37 BST The NIMDTA Sync jobs were started manually

  • 08:45 BST The NDW-ETL job (production) was rerun manually

  • 08:50 BST - Users informed that jobs had completed and TIS operating as normal

  • 09:17-09:34 BST - Breaking change reverted

  • 09:30 BST - NDW ETL finishes. NDW team informed.


Root Cause(s)

  • No messages were received in Slack

  • Lack of messages wasn’t picked up for several days

  • Job couldn’t start, despite the cron schedule firing

  • Diagnosis hampered by the split between Serverless runtime environments and VM environments

  • We didn’t even get a start message as it is only generated from within the job (at the start and end)

  • Major version upgrade of a dependency was missing a necessary class at runtime despite passing CI tests.

    • Even manual verifications wouldn’t pick this up

  • Tests don’t cover the scheduling functionality


Action Items

Action Items

Owner

  • Introduce testing of the scheduled components / Tests that verify the job runs

  • add a manual step to run jobs from a one-off cron expression (only if automated tests can’t be done)

Jayanta Saha

A enhancement of tests ticket is created here

https://hee-tis.atlassian.net/browse/TIS21-2624

Look at how we do scheduling across all the TIS stuff, possibly:

  • Use an external scheduler / verifier (e.g.CloudWatch Events)

  • send “start/failed to start” slack message earlier and “completed/errored” slack message at a later point to pick up more exceptions, with specific codes.

Reuben Roberts
Investigation ticket created: https://hee-tis.atlassian.net/browse/TIS21-2623

 Review responsibilities around checking jobs/slack, e.g.:

  • Sharing what people look out for

  • Reminding team of norms / expectations about checking application health

  • How to quickly find what is running where?

  • Have named people per week to check?

 Marcello Fabbri (Unlicensed) Yafang Deng Reuben Roberts Jayanta Saha

 Has the daily check for “completed” messages stopped running?

 Reuben Roberts

This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging.
Discussions with John Simmons (Deactivated) led to this ticket: https://hee-tis.atlassian.net/browse/TIS21-2621

Move all logging / ship all logs to CloudWatch

Have a documented place for where everything runs, e.g. handbook, Infra diagrams?

Tidy up definitions for ECS clusters (services with instance count = 0)

Marcello Fabbri (Unlicensed)


Lessons Learned

  •  

  • No labels