Date	03 Jan 2022
Authors	Liban Hirey (Unlicensed)
Status	Documenting
Summary	Prod Green server went down due to a scheduled retirement of the instance
Impact	Bulk Upload unavailable

Non-technical Description

The Prod Green server went down and it’s One of the servers that the TIS application is load-balanced between became unavailable.
After contacting AWS Support it turns out the server was shut down due to a scheduled retirement of the instance caused by underlying hardware issues.
We did not receive any emails from AWS informing us of this scheduled retirement as the address they send emails to (caaa@hee.nhs.uk) is not managed by our team

...

Trigger

Server instance state was “stopped” when looking at it in AWS

Trigger

Appears to have been an AMI issue as the following error is displaying in the instance details:
Code Block
EC2 can't retrieve the name because the AMI was either deleted or made private

Detection

Slack Alert at 4:02 AM on 03 Jan 2022

...

The server was restarted and started functioning accordingly however the AMI error message is concerning therefore will look at recreating the server with a new AMI

...

Timeline

03 Jan 2022 ~04:00 - Components started shutting down
03 Jan 2022 04:02 - Alerts triggered on slack
04 Jan 2022 08:50 - VM restarted in cloud console. Generic Upload available again.
06 Jan 2022 10:35 - Ticket opened with AWS Support
06 Jan 2022 10:55 - Response received from AWS Support

...

Root Cause(s)

~~The AMI used by the instance was deleted~~
~~Could this be due to the recent log4j vulnerability?~~
AWS notifies us that the instance was stopped due to scheduled retirement caused by an “unrecoverable issue with the underlying hardware”.

...

Action Items

...

Action Items	Owner
~~Recreate EC2 instance with a new AMI~~
~~Further investigate as~~ a number of our ~~multiple EC2 instances are showing the same AMI deleted/made private message~~
Investigate what triggered the server to go down

Lessons Learned

https://hee-tis.atlassian.net/browse/TIS21-2532
Mitigate this happening again by making sure we receive emails from AWS	https://hee-tis.atlassian.net/browse/TIS21-2533

...

Lessons Learned

Make sure we receive emails from AWS instead of it going to an account not managed by our team

Versions Compared

Old Version 3

New Version Current

Key

Non-technical Description

Trigger

Trigger

Detection

Timeline

Root Cause(s)

Action Items

Lessons Learned

Lessons Learned

Page Comparison

Versions Compared

Old Version 3

New Version Current

Key

Non-technical Description

Trigger

Trigger

Detection

Timeline

Root Cause(s)

Action Items

Lessons Learned

Lessons Learned