Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date

Authors

James Harris Yafang Deng Jayanta Saha

Status

Summary

ndw-etl-prod task was not run on Apr 11

Impact

Caused a failure of the User Refresh job on NDW.

Non-technical Description

The push of TIS data to NDW failed on Apr 11.


Trigger


Detection

  • Email from NDW team in the morning of 11/04/2023

  • In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:


Resolution

  • Contacted GMC support and technical contact at the GMC

  • Resolved by GMC


Timeline

BST unless otherwise stated

  • 04:03 - Since then, no more notifications had been received for the overnight ndw etl jobs.

  • 12:07 - James redirected the email that push of TIS data into NDW failed

  • 12:40 - Yafang found no logs found on ndw-etl-prod in the midnight

  • 13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod

  • 14:26 - Yafang & Jay triggered the task ndw-etl-prod manually

  • 14:56 - The task has been run and exited on etl-prod successfully

  • 15:11 - James let Guy know and NDW started their jobs.


Root Cause(s)

  • We expect the ndw-etl-prod job to be triggered by the AWS eventbridge rule every day at 2am UTC.

  • From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the ndw-etl-prod task was not started.


Action Items

Action Items

Comments

Owner

Lessons Learned

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.