Key Results: Benchmarking AWS and Azure
One of the Key results for this quarters OKR for the AWS migration is to perform a stress test against AWS. The following details this as well as other benchmarks targeting different users
Stress Testing
The following is a table of HTTP result counts against both Azure and AWS on one of the larger TIS components (TCS) against one of the slower endpoints (get post by id with placements).
Azure:
AWS:
Insights
Here, we can see that AWS has a lot lower error response rate during load. This could because the way the load is spread between each server.
Response Times
The following is a graph of response times in milliseconds from the same endpoint with 30 concurrent users accessing the same data
Insights
Again, AWS here is responding to responses at a faster rate than Azure with worse case scenarios being 110 milliseconds faster than Azure
More Response times
At the end of the day, the end users will be the main customer of the TIS system, we need to show that the move has not had any detrimental affect on the day to day work.
The following are some response times from the browser as the user will see them while working on TIS
Azure TIS after login:
AWS TIS after login:
Azure TIS view person (person id 28):
AWS TIS view person (person id 28):
Azure TIS view post:
AWS TIS view post:
Insights
From the screen grabs above, what we learn is that the browser spends the majority of the time running code from TIS but its clear to see that the idle time (time waiting for things like TIS responding) is greatly reduced in AWS. This could be because any number of things (better hardware, located closer etc) but at the end of the day, it shows that users are spending less time waiting
Reliability
Below is a demo of the reliability checks in play. In this video, we run TIS on 2 servers. Once a server has been disabled, health checks detect it and stop routing traffic to that server, allowing users to continue to access TIS. It does take a while to kick in but it also allow users to continue with their day to day and give IT time to fix any issues
https://www.loom.com/share/7a210877bd4242c3ac56cfa14b6c29f2
Insights
Theres more work to be done here as alerting could be configured but as AWS gives this feature with minimum effort, we’ve already got something better than Azure
Build times
One thing to make the development experience better for developers is to have fast turnaround (feedback) from external systems. Typically, when a developer develops a feature, they would push code to a central repository regularly, this code is a possible release candidate and therefore needs to go through a pipeline of different quality checks. This pipeline could take some time to complete, so you don’t want this developer waiting around for some feedback.
Below is several screen caps of the same pipelines for the same code/features with how long it takes to respond
Admins UI Azure:
Admins UI AWS:
TCS Azure:
TCS AWS:
ESR Inbound reader Azure:
ESR Inbound reader AWS:
Insights:
Comparing pipelines in both AWS and Azure, it's easy to see that there is up to circa 1 minute improvements in some stages over Azure. If multiple developers push multiple times a day, the compound savings could penitentially be enormous
Cost
For other stakeholders (management and C level staff), costs can be a defining factor in choosing a cloud provider.
Currently in Azure, we have an inventory of virtual machines and registries using storage space, the average monthly cost to run TIS in Azure is unknown due to being denied access to that information. AWS on the other will give an estimation.
The issue with this estimation, is that we currently have a lot of experiments and resources being used to for the migration and other projects (TIS SS and Reval)
Insights:
It's not currently possible to do an apples to apples comparison on the cost of running TIS in Azure to AWS. Also at the moment, we have done little to optimise cost and resource usages.
Its probably better to come back to this point at a later date
Related pages
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213