Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Non-technical Description

Bulk upload service is continually restarted and bulk upload webpage is continually refreshing.

...

Trigger

  • Threre were 2 super large file uploaded and completed with thousands of errors.

...

Detection

  • User queries:

Liam Lofthouse: Morning General - is there an issue with the bulk upload page? I seem… 

posted in TIS Support Channel / General at 21 April 2023 09:28:59

 

...

In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:

...

Resolution

  • Manually ran the ndw-etl-prod on ECS.

  • The start and finish of the task was notified in Slack.

...

Timeline

BST unless otherwise stated

  • 11 04:03 - Since then, no more notifications had been received for the overnight ndw etl jobs.

  • 11:32 - NDW team emailed about the failure

  • 12:07 - James redirected the email that push of TIS data into NDW failed

  • 12:40 - Yafang found no logs found on ndw-etl-prod in the midnight

  • 13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod

  • 14:26 - Yafang & Jay triggered the task ndw-etl-prod manually

  • 14:56 - The task has been run and exited on etl-prod successfully

  • 15:11 - James let Guy know and NDW started their jobs.

  • 13:45 - Guy confirmed there were no subsequent issues.

Root Cause(s)

  • We expect the ndw-etl-prod job to be triggered by the AWS eventbridge rule every day at 2am UTC.

  • From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the ndw-etl-prod task was not started.

  • The CloudTrail event history shows the reason of failure: "Capacity is unavailable at this time. Please try again later or in a different availability zone"

  • We could ask AWS why there was no capacity. The AWS Service Status indicates there was no failure or maintenance reducing expected capacity.

...

Root Cause(s)

...

Action Items

Other stuff we won’t do, at least for now, given the assumption about low probability of re-occurrence:

  • Automated retries

  • A separate job to check that the ETL has run for each environment

    Action Items

    Comments

    Owner

    Monitor CloudTrail event. Once we get this error, pick up another time to restart the task.

    Alert to Slack channel if runTask fails.

    Yafang Deng

    Jira Legacy
    serverSystem JIRA
    serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
    keyTIS21-4383

    Change config to use a wider choice of availability zones.

    Lessons Learned