Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

James Harris Yafang Deng Jayanta Saha

Status

Summary

ndw-etl-prod task was not run on Apr 11

Impact

Caused a failure of the User Refresh job on NDW.

Non-technical Description

The push of TIS data to NDW failed on Apr 11.

...

Trigger

...

Detection

  • Email from NDW team in the morning of 11/04/2023

  • In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:

    Image RemovedImage Added

...

Resolution

  • Contacted GMC support and technical contact at the GMC

  • Resolved by GMC

...

Timeline

BST unless otherwise stated

  • 04:03 - Since then, no more notifications had been received for the overnight ndw etl jobs.

  • 12:07 - James redirected the email that push of TIS data into NDW failed

  • 12:40 - Yafang found no logs found on ndw-etl-prod in the midnight

  • 13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod

  • 14:26 - Yafang & Jay triggered the task ndw-etl-prod manually

  • 14:56 - The task has been run and exited on etl-prod successfully

  • 15:11 - James let Guy know and NDW started their jobs.

...

Root Cause(s)

  • We expect the ndw-etl-prod job to be triggered by the AWS eventbridge rule every day at 2am UTC.

  • From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the ndw-etl-prod task was not started.

...

Action Items

Action Items

Comments

Owner

Lessons Learned