Date

11 Apr 2023

Authors

Status

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-4367

Summary

ndw-etl-prod task was not run on Apr 11

Impact

Caused a failure of the User Refresh job on NDW.

End users would see the stale data which were 24 hours before.

Non-technical Description

The push of TIS data to NDW failed was not run on Apr 11.

...

Trigger

Due to AWS capacity, we were not able to start the task.

...

Detection

Email from NDW team in the morning of 11/04/2023
In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:
Image RemovedImage Added

...

Resolution

...

Contacted GMC support and technical contact at the GMC

...

Manually ran the `ndw-etl-prod` on ECS.
The start and finish of the task was notified in Slack.
...

Timeline

BST unless otherwise stated

11 Apr 2023 04:03 - Since then, no more notifications had been received for the overnight ndw etl jobs.
11 Apr 2023 11:32 - NDW team emailed about the failure
11 Apr 2023 12:07 - James redirected the email that push of TIS data into NDW failed
11 Apr 2023 12:40 - Yafang found no logs found on ndw-etl-prod in the midnight
11 Apr 2023 13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod
11 Apr 2023 14:26 - Yafang & Jay triggered the task ndw-etl-prod manually
11 Apr 2023 14:56 - The task has been run and exited on etl-prod successfully
11 Apr 2023 15:11 - James let Guy know and NDW started their jobs.
12 Apr 2023 13:45 - Guy confirmed there were no subsequent issues.

...

Root Cause(s)

We expect the ndw-etl-prod job to be triggered by the AWS eventbridge rule every day at 2am UTC.
From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the ndw-etl-prod task was not started.
The CloudTrail event history shows the reason of failure: "Capacity is unavailable at this time. Please try again later or in a different availability zone"
We could ask AWS why there was no capacity. The AWS Service Status indicates there was no failure or maintenance reducing expected capacity.

...

Action Items

Action Items

Comments

Owner

Monitor CloudTrail event. Once we get this error, pick up another time to restart the task.

Alert to Slack channel if runTask fails.

Yafang Deng

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-4383

Change config to use a wider choice of availability zones.

Other stuff we won’t do, at least for now, given the assumption about low probability of re-occurrence:

Automated retries
A separate job to check that the ETL has run for each environment

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Manually ran the `ndw-etl-prod` on ECS.
The start and finish of the task was notified in Slack.
...

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Manually ran the ndw-etl-prod on ECS. The start and finish of the task was notified in Slack....

Timeline

Root Cause(s)

Action Items

Lessons Learned

Manually ran the `ndw-etl-prod` on ECS.
The start and finish of the task was notified in Slack.
...