Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Liban Hirey (Unlicensed)

Status

In progressDocumenting

Summary

Prod Green server went down due to a scheduled retirement of the instance

Impact

Bulk Upload unavailable

Non-technical Description

  • One of the servers that the TIS application is load-balanced between became unavailable.

Trigger

  • .

Detection

  • .

Resolution

  • .

Timeline

  • After contacting AWS Support it turns out the server was shut down due to a scheduled retirement of the instance caused by underlying hardware issues.

  • We did not receive any emails from AWS informing us of this scheduled retirement as the address they send emails to (caaa@hee.nhs.uk) is not managed by our team

...

Trigger

  • Server instance state was “stopped” when looking at it in AWS

Detection

  • Slack Alert at 4:02 AM on

...

Resolution

  • The server was restarted and started functioning accordingly

...

Timeline

  • ~04:00 - Components started shutting down

  • 04:02 - Alerts triggered on slack

  • 08:50 - VM restarted in cloud console. Generic Upload available again.

  • 10:35 - Ticket opened with AWS Support

  • 10:55 - Response received from AWS Support

...

Root Cause(s)

  • The AMI used by the instance was deleted

  • Could this be due to the recent log4j vulnerability?

  • AWS notifies us that the instance was stopped due to scheduled retirement caused by an “unrecoverable issue with the underlying hardware”.

...

Action Items

...

Action Items

Owner

Lessons Learned

Recreate EC2 instance with a new AMI

Further investigate as multiple EC2 instances are showing the same AMI deleted/made private message

Investigate what triggered the server to go down

https://hee-tis.atlassian.net/browse/TIS21-2532

Mitigate this happening again by making sure we receive emails from AWS

https://hee-tis.atlassian.net/browse/TIS21-2533

...

Lessons Learned

  • Make sure we receive emails from AWS instead of it going to an account not managed by our team