Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Verifying in Stage & Prod?

Date

Authors

Joseph (Pepe) Kelly (plus those mentioned on the page)

Status

Done

Summary

Dependency change ?broke the scheduling. The overnight jobs failed to run so some information was several days “stale” in TIS and downstream systems (NDW)

Impact

For several days, some data reflected the state as of 21st Jan.

...

Action Items

Owner

  • Introduce testing of the scheduled components / Tests that verify the job runs

  • add a manual step to run jobs from a one-off cron expression (only if automated tests can’t be done)

Jayanta Saha

Enhancement of tests ticket is created here

https://hee-tis.atlassian.net/browse/TIS21-2624

Look at how we do scheduling across all the TIS stuff, possibly:

  • Use an external scheduler / verifier (e.g.CloudWatch Events)

  • send “start/failed to start” slack message earlier and “completed/errored” slack message at a later point to pick up more exceptions, with specific codes.

Reuben Roberts
Investigation ticket created: https://hee-tis.atlassian.net/browse/TIS21-2623

 Review responsibilities around checking jobs/slack, e.g.:

  • Sharing what people look out for

  • Reminding team of norms / expectations about checking application health

  • How to quickly find what is running where?

  • Have named people per week to check?

 Marcello Fabbri (Unlicensed) Yafang Deng Reuben Roberts Jayanta Saha

 Has the daily check for “completed” messages stopped running?

 Reuben Roberts

This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging.
Discussions with John Simmons (Deactivated) led to this ticket: https://hee-tis.atlassian.net/browse/TIS21-2621

Move all logging / ship all logs to CloudWatch

Have a documented place for where everything runs, e.g. handbook, Infra diagrams?

Tidy up definitions for ECS clusters (services with instance count = 0)

Marcello Fabbri (Unlicensed)

...

Lessons Learned

  •   Group review for RCA and identifying action items from the root causes is very useful.