Date | |
Authors | |
Status | In Progress |
Summary | Bulk upload page is continually refreshing and showing “the server took too long to respond“. |
Impact | Users can not use Bulk upload as usual. Sometimes when bulk upload service is up for a short period of time, users are able to upload the file. Once the service is restarting, it can miss the response from other services and the file is stalled in progress. |
Non-technical Description
Bulk upload service
Trigger
Detection
User queries: Liam Lofthouse: Morning General - is there an issue with the bulk upload page? I seem…
posted in TIS Support Channel / General at 21 April 2023 09:28:59
In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:
Resolution
Manually ran the
ndw-etl-prod
on ECS.The start and finish of the task was notified in Slack.
Timeline
BST unless otherwise stated
04:03 - Since then, no more notifications had been received for the overnight ndw etl jobs.
11:32 - NDW team emailed about the failure
12:07 - James redirected the email that push of TIS data into NDW failed
12:40 - Yafang found no logs found on
ndw-etl-prod
in the midnight13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod
14:26 - Yafang & Jay triggered the task
ndw-etl-prod
manually14:56 - The task has been run and exited on etl-prod successfully
15:11 - James let Guy know and NDW started their jobs.
13:45 - Guy confirmed there were no subsequent issues.
Root Cause(s)
We expect the
ndw-etl-prod
job to be triggered by the AWS eventbridge rule every day at 2am UTC.From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the
ndw-etl-prod
task was not started.The CloudTrail event history shows the reason of failure:
"Capacity is unavailable at this time. Please try again later or in a different availability zone"
We could ask AWS why there was no capacity. The AWS Service Status indicates there was no failure or maintenance reducing expected capacity.
Action Items
Action Items | Comments | Owner |
---|---|---|
Monitor CloudTrail event. Once we get this error, pick up another time to restart the task. Alert to Slack channel if | ||
Change config to use a wider choice of availability zones. | ||
Other stuff we won’t do, at least for now, given the assumption about low probability of re-occurrence:
|
Add Comment