Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Date

Authors

James Harris Yafang Deng Jayanta Saha

Status

TIS21-4367 - Getting issue details... STATUS

Summary

ndw-etl-prod task was not run on Apr 11

Impact

Caused a failure of the User Refresh job on NDW.

Non-technical Description

The push of TIS data to NDW was not run on Apr 11.


Trigger

  • Nightly data push from TIS to NDW


Detection

  • Email from NDW team in the morning of 11/04/2023

  • In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:


Resolution

  • Manually ran the ndw-etl-prod on ECS.

  • The start and finish of the task was notified in Slack.


Timeline

BST unless otherwise stated

  • 04:03 - Since then, no more notifications had been received for the overnight ndw etl jobs.

  • 12:07 - James redirected the email that push of TIS data into NDW failed

  • 12:40 - Yafang found no logs found on ndw-etl-prod in the midnight

  • 13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod

  • 14:26 - Yafang & Jay triggered the task ndw-etl-prod manually

  • 14:56 - The task has been run and exited on etl-prod successfully

  • 15:11 - James let Guy know and NDW started their jobs.


Root Cause(s)

  • We expect the ndw-etl-prod job to be triggered by the AWS eventbridge rule every day at 2am UTC.

  • From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the ndw-etl-prod task was not started.

  • The CloudTrail event history shows the reason of failure: "Capacity is unavailable at this time. Please try again later or in a different availability zone"


Action Items

Action Items

Comments

Owner

Lessons Learned

  • No labels