Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Action Items

Owner

  • Introduce testing of the scheduled components / Tests that verify the job runs

  • add a manual step to run jobs from a one-off cron expression (only if automated tests can’t be done)

Jayanta Saha

Look at how we do scheduling across all the TIS stuff, possibly:

  • Use an external scheduler / verifier (e.g.CloudWatch Events)

  • send “start/failed to start” slack message earlier and “completed/errored” slack message at a later point to pick up more exceptions, with specific codes.

Reuben Roberts

 Review responsibilities around checking jobs/slack, e.g.:

  • Sharing what people look out for

  • Reminding team of norms / expectations about checking application health

  • How to quickly find what is running where?

  • Have named people per week to check?

 Marcello Fabbri (Unlicensed) Yafang Deng Reuben Roberts Jayanta Saha

 Has the daily check for “completed” messages stopped running?

 Reuben Roberts

This Ansible tool is probably not worth resuscitating, as it was apparently not very polished, and would need tobe extended to cover missed messaging.
Discussions with John Simmons (Deactivated) led to this ticket: https://hee-tis.atlassian.net/browse/TIS21-2621

Move all logging / ship all logs to CloudWatch

Have a documented place for where everything runs, e.g. handbook, Infra diagrams?

Tidy up definitions for ECS clusters (services with instance count = 0)

Marcello Fabbri (Unlicensed)

...