With a design largely defined at a light level, there were a lot of questions yet to be answered and decisions yet to be made. So we decided to have a whiteboard session to map the areas that we thought we needed to address before continuing with the new implementation.
...
- We agreed on using RabbitMQ as it seemed to be the most user friendly with possibly the largest market share
- Multiple queues will be used with the basis of a singular inbound queue, multiple internal ones and possibly multiple outbound queues
- The queues will use binding keys to route the messages rather than headers
- As a default, auto ack will be disabled which means that a consumer will need to manually acknowledge a message in order for it to be removed from the queue. This should only be done after the message has been processed and and output to be either pushed on another queue or saved. This ensures that messages will not be lost when consumers fail
- We will store the amount of retries on the message and a max of 10 retries will be allowed
- The wait time (back off) algo will be 1 minute before it can be processed again
- We'll attempt to use an intelligent circuit breaker which will look at the type of exception and judge whether it would make sense to retry. Errors such as HTTP 400's wont make sense to retry as it will be an issue with the client (which will most likely require dev)
- Dead letters will be split between different dead letter queues by type
- We'll need to come back to rate limiting as we do not want to overwhelm our own systems
- we'll need to base this on some actual metrics
- We'll have duplicate queues so that we can create an audit service to record message/events
- The granularity of audits will need to be defined as too granular may create an influx of data which may not be useful
View file | ||||
---|---|---|---|---|
|
Inbound message types:
-APC Applicant confirmation message
-POS Position information
-POR Position reconciliation
-PER Person record
-ADD Address
-QAL Qualifications
-ABS Absence
-DCC Notification confirmation message
Outbound message types:
-APP Applicants
-DNC/DNF file Notifications
-RTC confirmation messages
Messages - on hold
- As we're running messages through the system, it may be possible to reduce the amount of messages within a certain period
- There may be data that we don't care about and can remove from the messages
...
- Will need to be fast
- Might be good to work will with the message body (Json)
- Doesn't look like we need a relational DB so a document store may be enough (Mongo/Cosmos/Dynamo)
- Does this tie into how we audit?
- What features to do we need?
- UX?
- Whats the debug journey like?
- What key terms do we search by?
- What sort of issues do we have?
...
There is work on going to assess and possibly migrate to a Kubernetes infrastructure. As TIS is currently deployed onto the cloud, some thought needs to be had to make it cloud native. There are certain features on the cloud native space that are shared in both kubernetes and Spring boot that are not compatible with each other (Service registration, failover, retries etc)
Automated Testing
It was agreed that any high quality system will require a suite of automated tests. The team currently has strong experience with testing and at the unit level but has identified that we may need to up skill in terms of functional / integration / end to end tests.
...
We don't want to leave security till the end and said . We agreed that we want to bake it in security into the system from the start of this the project. There are areas we've identitified identified we'll need to work on
- Services
- We'll use the current JWT implementation
- Queue
- We'll need to read up on the documentation on how to harden/lock down/productionize a RabbitMQ cluster
- Storage container
- Azure already encrypts data at rest using AES and in transport
- Cloud Infra
- Use whatever cloud level security measures, so whitelisting, opening up required ports etc
Development & Deployment Strategy
This project will require both the fixing of existing bugs and the rearchitecting for this new system. We will need to take extra care in being efficient and not duplicate work where possible and ensure that any new work being done for the "New world" will not effect the current TIS system nor the current ESR integration.
We spoke about running the new features within feature flags so that the code will not run during normal dev through to deployment on Prod. This would also mean we'll need new environments. It was suggested that Dev2 and Stage2 could be created and have the feature flags switched on there.
If we continue to use spring boot and use the out of the box feature for connecting to a message queue, we may need to extend and modify the auto configuration classes to disable auto connection to a queue. This is because the default behaviour is to load the configuration for connectivity if certain libraries have been loaded onto the classpath. So deployments to dev → prod will need to connect to a queue even though it wont be using it.
Costs
Having this new architecture will mean that we'll need to have at least one machine to host the message queue system. We'll need to come back to find ways to save on costs as having a resilient system is paramount but can be expensive
** https://cloud.spring.io/spring-cloud-contract/reference/htmlsingle/
...