2023-05-19 TIS unavailable for Northern Ireland

Date

May 19, 2023

Authors

@John Simmons (Deactivated) @Joseph (Pepe) Kelly

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-4563

A configuration defect meant that when a machine was restarted, the address for the Northern Ireland no longer went to TIS.

Impact

Northern Ireland users were unable to access TIS until the configuration was corrected.

Non-technical Description

NIMDTA TIS became unavailable when a downgrade of the server size to save costs happened. Although the new smaller service started and became responsive to our internal testing, the external access failed.


Trigger

  • Resizing the NIMDTA apps and database servers to save money


Detection

  • User report (Slack)

  •  

     


Resolution

  • Added new servers' public IP address to DNS to enable the service to be used as quickly as possible

  • Added a permanent IP address for the NIMDTA web server so that any further stopping and starting of the server will result in the same IP address being used each time.


Timeline

BST unless otherwise stated

  • May 18, 2023 - Machines resized and restarted, once they became available an SSH login was performed and access to both servers was there. (this procedure only tested the private addresses not the public address)

  • May 19, 2023 - 9:52 am Mark Oliver messages on Slack to say there is no connection to TIS

  • May 19, 2023 - 10.05 am Problem identified and correction started

  • May 19, 2023 - 10.15 am Correction applied

  • May 19, 2023 - 10.20 am DNS changes took effect after 600 second window.

  • May 19, 2023 - 10.21 am Service restored and Mark was asked to test connections

  • May 19, 2023 - 11.16 am Mark Oliver confirms all is working as expected

  • May 19, 2023 - 11.45 am Elastic IP address assigned to Nimdta apps server, and DNS updated to stop this happening again.

 


Root Cause(s)

  • Trying to reach Admins UI resulted took too long and resulted in an error.

  • The website address referred to an IP address which was not reachable.

  • We should also have had an alert from UptimeRobot to say that the NIMDTA service was not available. This would have alerted us to the problem before the end users found it, but unbeknownst to us all of our external monitoring has been removed from UptimeRobot without telling us.

  • The web server did not have the IP address assigned to it.

  • When the server had been been initially built an elastic IP address had not been assigned to the server. Therefore reboots would probably have kept the original public IP address but a full stop, then start of the service would defiantly have resulted in a new public IP address being assigned to the VM.


Action Items

Action Items

Comments

Owner

Action Items

Comments

Owner

Add elastic IP Address Creation/Assignment to Terraform config for servers that need public IP addresses.

 

@John Simmons (Deactivated)

Add external monitoring of public facing websites

 

@John Simmons (Deactivated)

Lessons Learned

  • Do not just check the private IP address to see if a server is back up from a restart as that only checks the private IP address