Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

One of the Key results for this quarters OKR for the AWS migration is to perform a stress test against AWS. The following details this as well as other benchmarks targeting different users

...

The following is a table of HTTP result counts against both Azure and AWS on one of that the larger TIS components (TCS) against one of the slower endpoints (get post by id with placements).

Azure:

...

AWS:

...

Insights

Here, we can see that AWS has a lot lower error response rate during load.

{table}

Insights:

This could because the way the load is spread between each server.

Response Times

The following is a graph of response times in milliseconds from the same endpoint with 30 concurrent users accessing the same data

...

Insights

Again, AWS here is responding to responses at a faster rate than Azure with worse case scenarios being 110 milliseconds faster than Azure

More Response times

At the end of the day, the end users will be the main customer of the TIS system, we need to show that the move has not had any detrimental affect on the day to day work.

The following are some response times from the browser as the user will see them while working on TIS

Azure TIS after login:

...

AWS TIS after login:

...

Azure TIS view person (person id 28):

...

AWS TIS view persion:person (person id 28):

...

Azure TIS view post:

...

Aws AWS TIS view post:

The following is a graph of response times from the same endpoint with indications of percentiles

{graph}

Insights:

...

Insights

From the screen grabs above, what we learn is that the browser spends the majority of the time running code from TIS but its clear to see that the idle time (time waiting for things like TIS responding) is greatly reduced in AWS. This could be because any number of things (better hardware, located closer etc) but at the end of the day, it shows that users are spending less time waiting

Reliability

Below is a demo of the reliability checks in play{video}. In this video, we run TIS on 2 servers. Once a server has been disabled, health checks detect it and stop routing traffic to that server, allowing users to continue to access TIS. It does take a while to kick in but it also allow users to continue with their day to day and give IT time to fix any issues

https://www.loom.com/share/7a210877bd4242c3ac56cfa14b6c29f2

Insights

Theres more work to be done here as alerting could be configured but as AWS gives this feature with minimum effort, we’ve already got something better than Azure

Build times

One thing to make the development experience better for developers is to have fast turnaround (feedback) from external systems. Typically, when a developer develops a feature, they would push code to a central repository regularly, this code is typically a possible release candidate and therefore needs to go through a pipeline of different quality checks. This pipeline could take some time to complete, so you don’t want this developer waiting around for some feedback.

...

Comparing pipelines in both AWS and Azure, it's easy to see that there is up to circa 1minute 1 minute improvements in some stages over Azure. If multiple developer pushes developers push multiple times a day, the compound savings could penitentially be enormous

Cost

For other stakeholders (management and C level staff), costs can be a defining factor in choosing a cloud provider.

Currently in Azure, we have an inventory of virtual machines and registries using storage space, the average monthly cost to run TIS in Azure is…is unknown due to being denied access to that information. AWS on the other will give an estimation.

...

The issue with this estimation, is that we currently have a lot of experiments and resources being used to for the migration and other projects (TIS SS and Reval)

Insights:

It's not currently possible to do an apples to apples comparison on the cost of running TIS in Azure to AWS. Also at the moment, we have done little to optimise cost and resource usages.

Its probably better to come back to this point at a later date