AWS the moon on a stick

On Friday 3rd April, the development team came together in the first step to the migration to AWS. We came together virtually as there was a little global pandemic happening so the conversation was facilitated via MS Teams with the usage of aww to sketch up any diagrams and notes.

It's been known for some time now that the current cloud provider (Microsoft Azure) was not suited for what we need/want. Issues with the speed in which infrastructure is created/destroyed. Features that are considered are a must are missing, poor documentation, poor support are some of the issues to say the least.

We’ve been given the approval to use AWS in place of Azure as its more mature, has the features we need and is a leader in the space.

The plan for this session was understand what we currently have in Azure as well as their dependencies. This will give us some transparency on what would be required in AWS to have the same sort of features as well as have some appreciation of the size of the task at hand.

What we currently have

The following is what was drawn up from the session

Apologies for the low resolution image! aww didn’t export the file in high resolution.

Below is a description of the current Azure system.

  • The applications are currently deployed to 2 virtual machines (blue/green) per environment (dev, staging, production)

    • These app servers host the docker containers for TIS as well as run the Apache service for reverse proxying

    • Various other environments have additional VM’s for branch based testing (Pink)

  • Storing the data for these applications is a single (per environment) database server hosting MySql. This also holds an instance of Maxwell’s daemon as a docker container to provided CDC forwarding to RabbitMQ

  • MongoDB is also used to store information for the ESR integration system. This is currently a 3 docker node system on one VM with the idea to moving to 3 separate VMs

  • There are a number of ElasticSearch instances of different versions, holding

  • RabbitMQ is on a cluster of 3 VM’s per environment, deployed as containers with another container holding a management web console

  • We have a build server that currently holds Jenkins, Sonarqube and metabase

  • N3 bridge which is hosted by IT. This allows us to connect to the wider NHS network (ESR)

  • A jumpbox (bastion server) to allow SSH connectivity

  • A VMs hosting monitoring tools such as Graphana, Prometheus

  • Integration environment (single VM) used to spin up and test (E2e) ESR

Applications/Services

The following is a list of managed services used in Azure as well as other HEE applications

  • Azure VM’s

  • Azure Blob store

  • Azure container registry

  • MS SQL for the NDW

  • Managed disks (VMs)

  • Data disk snapshots

  • TIS (Profile, Reference, TCS, Admins UI, Generic Upload, GMC connect, Keycloak, Notifications, Reval, Concerns, User management, Assessments, Service status)

  • ESR Integration (Inbound data reader, reconciliation, app record generator, data writer, inbound data writer, notification generator, audit, neo audit)

Downstream

Various downstream systems/products

  • A number of ETL’s but mainly for the NDW

  • GMC (requires whitelisting)

  • NDW

 

The moon on a stick

With a somewhat complete list of the services and applications we currently use in TIS, we have a good idea how large or at least some idea of the scale of the task at hand. We next wrote a list of things we want from the migration, ensuring we would will provide the same or better level of service (of TIS) once we migrate as well as see/improve other things such as tooling while at it.

This is what we came up with:

Again sorry for the poor resolution. Here’s a description of what was noted:

  • We should start from the database out

  • Currently using a lot of VM’s, currently consuming a lot of DevOps time

  • Would be good to move the DB into a managed service

  • Current MongoDB instance has a number of managed service alternatives - Atlas (Mongo’s official service) or DocumentDB (AWS)

  • MySql could be moved to Aurora or RDS

  • Neo4J also has a managed service in AWS called Neptune

  • Migration considerations - we need an investigation in the options but known option is DMS

 

Services/Applications

  • Load balancing should be used where possible (unlike the current implementation where it only routes to a specific VM depending on the source address)

  • Application LB is preferred where possible as it provides url based routing

  • Route 53 for DNS and WAF should be used (Route 53 already in place)

  • For applications that are already dockerised, we should prefer orchestration services (ECS, EKS) where possible as it saves time and money

  • Other options include Beanstalk / Fargate / Serverless (Lambda) for basic java applications, light/fast containerised apps or ETL type jobs

  • AWS provides and managed elastic service

  • Cognito seems to be the current choice for authentication

    • Provides a good migration path for KC

  • Managed services for build tools were possible, moving away from jenkins will be good, could look into CircleCI, AWS code deploy and Drone (currently trialling GitHub Actions)

 

Questions/unknowns/concerns

  • Serverless with java

    • Startup time for the JVM

    • version support of the JVM

    • what other languages can we use / what other languages do the team know?

    • ECS supports fargate as well as EC2 which could possibly allow for piecemeal migrations

  • EFS support in fargate?

  • Other forms of notifications available - could we potentially use SMS for Trainee self service

  • We’re currently using RabbitMQ, are there any alternatives (SQS + SNS)? is it still valid to stick with it? should we look into Kafka/Data streams?

  • Should we look at extending ES usage to other parts of the app?

  • Currently there are a couple of known limitations in GitHub Actions

  • The wider NHS/Public sector has an Auth service but its currently not far enough in its development, its perhaps worth keeping that in mind while working on Cognito or perhaps even delaying it until the product is ready

Whats next?

So with this laid out, we now have a better idea where we want to be, a number of investigation tasks need to happen (around the build tools, monitoring, serverless, messaging) to gain a little more knowledge on them and to assess whether they are right for us. Designs of the standard networking approach as well as the security considerations that we need to bake in so that environments for each application don't stray too far, making it difficult to manage.

Summary

All in all, this meeting has made it clear that we would like to use as much of what makes using a cloud provider good, which is to use as much of the managed services as possible - saving on costs where possible, providing features as automatic scalability and the saving limited time of the DevOps engineers.