Post Migration - A Postmortem
The night of the 23rd September was raining heavily with whistling sharp winds and roaring thunder, this was the might of the migration. Little did Azure know that this night, was its final night.
After 3 hours of brutal keyboard bashing, the monster finally cowered down and finally gave way to the future - AWS.
Joking aside, we finished the migration to AWS on the 23rd September, out of hours so that no one should have been affected. We proceeded by putting up a holding/maintenance page up and stopping TIS, so that there was no way for users to change the data while the work was being carried out.
Data was exported from Azure and imported into AWS. Once that was done, TIS was brought back in AWS with domains updated to point at AWS. Some light tests were carried out and all looked good
Success!
Snags:
So the following day, we discovered 2 issues. One with the new world ESR database - this wasn’t much of an issue as its not yet in use and it is being cleared. The second which had a larger impact was with the old connections discrepancy app.
The issue is with the way the system accesses the GMC. We’ve moved platforms and therefore our public IP address is different. We’ve already contacted the GMC to make this change and so the fix will be implemented soon
Lessons Learnt
This is where it would be good to get some feedback (sprint review ). The approach taken was to take the site down for maintenance. Did this have any business impact (during the late hours)? did the issues with reval cause too much pain?
Looking back, perhaps there were other ways to mitigate the downtime (which was thought about a lot during the project) so perhaps reaching out to other development communities may have come up with better alternatives.
We also didn’t let users know which date the switch over was going to happen, this was mainly because there were some tickets that had to be done before hand in the sprint and there was no certainty that they were going to be done at a particular time.
Future/Whats next
One of the issues with this type of migration is that although we’re moving to a better platform, any underlying issues with TIS will also remain. There are many things that need to be done to correct these but we will have to weigh the cost of fixing them to be better within the existing architecture or completely rewrite parts of TIS to be better integrated into AWS
Where we are now
Below is a very high level overview of what we now have in AWS - it doesn’t show everything just how some things are laid out
From the diagram above, you will notice that all the servers are placed in a single AWS region. Regions are synonymous to data centres. The issue with this is that if there is an issue with that region such as a network connectivity issue to that building, then the whole of TIS won’t be accessible.
Another issue is that all of the servers are within a public subnet, this means that each server will have a public IP and therefore it will be possible to access them directly from the web (this does not mean they are fully exposed). Without the correct security configuration, these servers could potentially be tampered with.
Below are a few options on what we could do from here on
Traditional Architecture
Traditional network architecture typically exposes web applications servers to serve the web site assets with everything else in a network that can only be accessed from within AWS (no direct access from the web). The web application servers here would hold apache and would only need to forward approved traffic to the application servers that hold the TIS services
The changes required here would be to create the new private subnets, new applications servers and deploy the services on them. Move the other machines into the private subnet and possibly shrink the web servers as they don’t need to be highly spec’ed machines if they just forward requests
Pro’s
App’s are more secure
Not too much work from where we are now
Not too different from existing infra so team members should be able to grasp this easily
Con’s
Need more machines
Same level of reliability
One database server is a major point of failure, also causes monitoring false possitives when running heavy queries
One step beyond the previous diagram, we can chose to cluster the servers that can be clustered across regions. This would be native clustering where you would cluster the application servers (JBoss/Glassfish etc are some app servers), database servers and messaging servers. This will provide some level of reliability if a region becomes unavailable
Pro’s
More reliable
Most of the work has already been done for rabbit, mongo and es
Con’s
No clustering of the DB is done atm
Going even further, we can create a docker swarm cluster. This would be a docker native approach which would allow us to abstract the notion of servers from developers and replace it with resource. To developers, you no longer think about the underlying servers but how much resource is left/in use.
This approach would be highly available and resource efficient as each node will accept traffic for any service in the swarm and automatically forward it to the correct node. Services will also be deployed on any node (without any input from developers) that has the available resource, therefore efficiently using the underlying resources. Nodes can also be dynamically removed and applied, so any changes in requirements can be met.
Pro’s
Efficient use of resources - could reduce the monthly bill
Docker native, ensuring alignment in all services and deployments
Uses docker compose to deploy services so all TIS services “SHOULD” be able to deploy to a swarm cluster with minimal changes
Con’s
Docker swarm is no longer actively being developed and is apparently in maintenance mode (because the popularity of Kubernentes)
Something new for developers to learn
Current ways of debugging won’t work - will require better usage of observability tools
Modern with managed services
AWS as well as other cloud providers have come a long way in the last 5 years and have changed the way in which applications have been developed and deployed.
One thing they’ve provided are products that reduce the amount of management overhead in terms of resources. This means that AWS could provide developers with a service to not only host applications but to also manage the underlying resources.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213