Date

Authors

James Harris Yafang Deng Jayanta Saha

Status

Summary

ndw-etl-prod task was not run on Apr 11

Impact

Caused a failure of the User Refresh job on NDW.

End users would see the stale data which were 24 hours before.

Non-technical Description

The push of TIS data to NDW was not run on Apr 11.


Trigger


Detection


Resolution


Timeline

BST unless otherwise stated


Root Cause(s)


Action Items

Action Items

Comments

Owner

Monitor CloudTrail event. Once we get this error, pick up another time to restart the task.

Alert to Slack channel if runTask fails.

Yafang Deng

Change config to use a wider choice of availability zones.

Other stuff we won’t do, at least for now, given the assumption about low probability of re-occurrence:

  • Automated retries

  • A separate job to check that the ETL has run for each environment

Lessons Learned