The Technical Decisions

With a design largely defined at a light level, there were a lot of questions yet to be answered and decisions yet to be made. So we decided to have a whiteboard session to map the areas that we thought we needed to address before continuing with the new implementation.

Decisions

Queuing system

We agreed on using RabbitMQ as it seemed to be the most user friendly with possibly the largest market share
Multiple queues will be used with the basis of a singular inbound queue, multiple internal ones and possibly multiple outbound queues
The queues will use binding keys to route the messages rather than headers
As a default, auto ack will be disabled which means that a consumer will need to manually acknowledge a message in order for it to be removed from the queue. This should only be done after the message has been processed and and output to be either pushed on another queue or saved. This ensures that messages will not be lost when consumers fail
We will store the amount of retries on the message and a max of 10 retries will be allowed
The wait time (back off) algo will be 1 minute before it can be processed again
We'll attempt to use an intelligent circuit breaker which will look at the type of exception and judge whether it would make sense to retry. Errors such as HTTP 400's wont make sense to retry as it will be an issue with the client (which will most likely require dev)
Dead letters will be split between different dead letter queues by type
We'll need to come back to rate limiting as we do not want to overwhelm our own systems
- we'll need to base this on some actual metrics
We'll have duplicate queues so that we can create an audit service to record message/events
- The granularity of audits will need to be defined as too granular may create an influx of data which may not be useful

Messages - on hold

As we're running messages through the system, it may be possible to reduce the amount of messages within a certain period
There may be data that we don't care about and can remove from the messages

Schema

To increase quality of the data, we'll need a way to validate the inbound dataset

Some form of schema will be required to validate the data. We already mentioned Json with Json schema would be nice
Data that doesn't conform to the schema will be dropped into the deadletter queue
We should probably validate data coming out of the system too

Auditing

This will be a vital part of the system as currently we don't know much on what's what in the current implementation. Finding details of when, what and why something has happened is very difficult

We should log the actual data in the system
log meta data with the actual data (date created, modified, service name, triggers, exceptions etc)
It would be good to use this information to create a system that can log a journey through the system and then intelligently indicate if something is yet to be exported or still running through the system

Reporting

Feedback from a workshop (Sept 25th)

Database

What sort of database would we need? Do we need one at all?

Will need to be fast
Might be good to work will with the message body (Json)
Doesn't look like we need a relational DB so a document store may be enough (Mongo/Cosmos/Dynamo)
Does this tie into how we audit?
- What features do we need?
- UX?
- Whats the debug journey like?
- What key terms do we search by?
- What sort of issues do we have?

Managed Services - To revisit

It may be worth using the cloud providers managed services from the get go but we'll need to consider migration issues and the business case for resiliency

Cloud native - K8s & Spring boot - To revisit

There is work on going to assess and possibly migrate to a Kubernetes infrastructure. As TIS is currently deployed onto the cloud, some thought needs to be had to make it cloud native. There are certain features on the cloud native space that are shared in both kubernetes and Spring boot that are not compatible with each other (Service registration, failover, retries etc)

Automated Testing

It was agreed that any high quality system will require a suite of automated tests. The team currently has strong experience with testing at the unit level but has identified that we may need to up skill in terms of functional / integration / end to end tests.

Frameworks/Techniques such has rest assured, mocking and contract testing were discussed**, also a FT dedicated tester was discussed

Monitoring

Depending on the final architecture, we may need to put more focus on monitoring on certain parts. Components such as the queue will need to be heavily monitored

Measurements on the queue
- size of the queues
- throughput
- exceptions thrown

Security

We don't want to leave security till the end. We agreed that we want to bake security into the system from the start of the project. There are areas we've identified we'll need to work on

Services
- We'll use the current JWT implementation
Queue
- We'll need to read up on the documentation on how to harden/lock down/productionize a RabbitMQ cluster
Storage container
- Azure already encrypts data at rest using AES and in transport
Cloud Infra
- Use whatever cloud level security measures, so whitelisting, opening up required ports etc

Development & Deployment Strategy

This project will require both the fixing of existing bugs and the rearchitecting for this new system. We will need to take extra care in being efficient and not duplicate work where possible and ensure that any new work being done for the "New world" will not effect the current TIS system nor the current ESR integration.

We spoke about running the new features within feature flags so that the code will not run during normal dev through to deployment on Prod. This would also mean we'll need new environments. It was suggested that Dev2 and Stage2 could be created and have the feature flags switched on there.

If we continue to use spring boot and use the out of the box feature for connecting to a message queue, we may need to extend and modify the auto configuration classes to disable auto connection to a queue. This is because the default behaviour is to load the configuration for connectivity if certain libraries have been loaded onto the classpath. So deployments to dev → prod will need to connect to a queue even though it wont be using it.

Costs

Having this new architecture will mean that we'll need to have at least one machine to host the message queue system. We'll need to come back to find ways to save on costs as having a resilient system is paramount but can be expensive

** https://cloud.spring.io/spring-cloud-contract/reference/htmlsingle/

** https://docs.pact.io/