Date |
|
Authors | |
Status | Documenting |
Summary | Prod Green server went down due to a scheduled retirement of the instance |
Impact | Bulk Upload unavailable |
...
One of the servers that the TIS application is load-balanced between became unavailable.
After contacting AWS Support it turns out the server was shut down due to a scheduled retirement of the instance caused by underlying hardware issues.
We did not receive any emails from AWS informing us of this scheduled retirement as the address they send emails to (caaa@hee.nhs.uk) is not managed by our team
...
Trigger
Server instance state was “stopped” when looking at it in AWS
Appears to have been an Amazon Machine Images (AMI) issue as the following error is displaying in the instance details:
Code Block EC2 can't retrieve the name because the AMI was either deleted or made private
Detection
Slack Alert at 4:02 AM on
...
The server was restarted and started functioning accordingly however the AMI error message is concerning therefore will look at recreating the server with a new AMI
...
Timeline
~04:00 - Components started shutting down
04:02 - Alerts triggered on slack
08:50 - VM restarted in cloud console. Generic Upload available again.
10:35 - Ticket opened with AWS Support
10:55 - Response received from AWS Support
...
Root Cause(s)
The AMI used by the instance was deletedCould this be due to the recent log4j vulnerability?AWS notifies us that the instance was stopped due to scheduled retirement caused by an “unrecoverable issue with the underlying hardware”.
...
Action Items
Action Items | Owner |
---|---|
| |
| |
Investigate what triggered the server to go down | https://hee-tis.atlassian.net/browse/TIS21-2513Further investigate as a number of our EC2 instances are showing the same AMI deleted/made private message2532 |
Mitigate this happening again by making sure we receive emails from AWS | https://hee-tis.atlassian.net/browse/TIS21-2513Investigate what triggered this change on the AMI so as to mitigate it reoccurring2533 |
...
Lessons Learned
.Make sure we receive emails from AWS instead of it going to an account not managed by our team