Date |
|
Authors | |
Status | Working on it |
Summary | |
Impact |
...
-09:48 AM | |
- 10:25 AM | Created ticket and incident page https://hee-tis.atlassian.net/browse/TISNEW-5728 |
Root Causes
Accidental major version update to one of our core infrastructure tools caused a failure in a dependent tool. This was ok in itself with the containers that were running, but no new containers could launch, ie ETL’s or newly deployed software versions.
...
NDW ETL (Prod) failure alert on Slack
Reval / GMC sync ETL’s failure alert on Slack
Actions
[insert actions to take to mitigate this happening in future]
e.g.
keep everything more up to date to avoid major impacts of upgrades in future
ensure one person is not a single point of failure - required code reviews for infrastructure changes
specific changes to the architecture (list them) to improve resilience:
Use of ‘serverless’ technology: ECS, RDS, DocumentDB
check STAGE matches PROD upgrade
...