Date | |||||||||
Authors | |||||||||
Status | Done | ||||||||
Summary |
A configuration defect meant that when a machine was restarted, the address for the Northern Ireland no longer went to TIS. | ||||||||
Impact | Northern Ireland users were unable to access TIS until the configuration was corrected. |
Non-technical Description
NIMDTA TIS became unavailable when a downgrade of the server size to save costs happened. Although the new smaller service started and became responsive to our internal testing, the external access failed.
...
Trigger
Resizing the NIMDTA apps and database servers to save money
...
Detection
User report (Slack)
...
Resolution
Added a more stable address for Northern Ireland’s TISnew servers' public IP address to DNS to enable the service to be used as quickly as possible
Added a permanent IP address for the NIMDTA web server so that any further stopping and starting of the server will result in the same IP address being used each time.
...
Timeline
BST unless otherwise stated
19 - Machine Machines resized and restarted, once they became available an SSH login was performed and access to both servers was there. (this procedure only tested the private addresses not the public address)
- 9:52 am Mark Oliver messages on Slack to say there is no connection to TIS
- 10.05 am Problem identified and correction started
- 10.15 am Correction applied
- 10.20 am DNS changes took effect after 600 second window.
- 10.21 am Service restored and Mark was asked to test connections
- 11.16 am Mark Oliver confirms all is working as expected
- 11.45 am Elastic IP address assigned to Nimdta apps server, and DNS updated to stop this happening again.
...
Root Cause(s)
Trying to reach Admins UI resulted took too long and resulted in an error.
The website address referred to an IP address which was not reachable.
We should also have had an alert from UptimeRobot to say that the NIMDTA service was not available. This would have alerted us to the problem before the end users found it, but unbeknownst to us all of our external monitoring has been removed from UptimeRobot without telling us.
The web server did not have the IP address assigned to it.
When the server had been been initially built an elastic IP address had not been assigned to the server. Therefore reboots would probably have kept the original public IP address but a full stop, then start of the service would defiantly have resulted in a new public IP address being assigned to the VM.
...
Action Items
Action Items | Comments | Owner |
---|---|---|
Add elastic IP Address Creation/Assignment to Terraform config for servers that need public IP addresses. | ||
Add external monitoring of public facing websites |
Lessons Learned
Do not just check the private IP address to see if a server is back up from a restart as that only checks the private IP address