With a design largely defined at a light level, there were a lot of questions yet to be answered and decisions yet to be made. So we decided to have a whiteboard session to map the areas that we thought we needed to address before continuing with the new implementation.
Decisions
Queuing system
- We agreed on using RabbitMQ as it seemed to be the most user friendly with possibly the largest market share
- Multiple queues will be used with the basis of a singular inbound queue, multiple internal ones and possibly multiple outbound queues
- The queues will use binding keys to route the messages rather than headers
- As a default, auto ack will be disabled which means that a consumer will need to manually acknowledge a message in order for it to be removed from the queue. This should only be done after the message has been processed and and output to be either pushed on another queue or saved. This ensures that messages will not be lost when consumers fail
- We will store the amount of retries on the message and a max of 10 retries will be allowed
- The wait time (back off) algo will be 1 minute before it can be processed again
- We'll attempt to use an intelligent circuit breaker which will look at the type of exception and judge whether it would make sense to retry. Errors such as HTTP 400's wont make sense to retry as it will be an issue with the client (which will most likely require dev)
- Dead letters will be split between different dead letter queues by type
- We'll need to come back to rate limiting as we do not want to overwhelm our own systems
- we'll need to base this on some actual metrics
- We'll have duplicate queues so that we can create an audit service to record message/events
- The granularity of audits will need to be defined as too granular may create an influx of data which may not be useful
Messages - on hold
- As we're running messages through the system, it may be possible to reduce the amount of messages within a certain period
- There may be data that we don't care about and can remove from the messages
Schema
To increase quality of the data, we'll need a way to validate the inbound dataset
- Some form of schema will be required to validate the data. We already mentioned Json with Json schema would be nice
- Data that doesn't conform to the schema will be dropped into the deadletter queue
- We should probably validate data coming out of the system too
Auditing
This will be a vital part of the system as currently we don't know much on what's what in the current implementation. Finding details of when, what and why something has happened is very difficult
- We should log the actual data in the system
- log meta data with the actual data (date created, modified, service name, triggers, exceptions etc)
- It would be good to use this information to create a system that can log a journey through the system and then intelligently indicate if something is yet to be exported or still running through the system
Reporting
Feedback from a workshop (Sept 25th)
Database
What sort of database would we need? Do we need one at all?
- Will need to be fast
- Might be good to work will with the message body (Json)
- Doesn't look like we need a relational DB so a document store may be enough (Mongo/Cosmos/Dynamo)
- Does this tie into how we audit?
- What features do we need?
- UX?
- Whats the debug journey like?
- What key terms do we search by?
- What sort of issues do we have?
Managed Services - To revisit
It may be worth using the cloud providers managed services from the get go but we'll need to consider migration issues and the business case for resiliency
Cloud native - K8s & Spring boot - To revisit
There is work on going to assess and possibly migrate to a Kubernetes infrastructure. As TIS is currently deployed onto the cloud, some thought needs to be had to make it cloud native. There are certain features on the cloud native space that are shared in both kubernetes and Spring boot that are not compatible with each other (Service registration, failover, retries etc)
Automated Testing
It was agreed that any high quality system will require a suite of automated tests. The team currently has strong experience with testing at the unit level but has identified that we may need to up skill in terms of functional / integration / end to end tests.
Frameworks/Techniques such has rest assured, mocking and contract testing were discussed**, also a FT dedicated tester was discussed
Monitoring
Depending on the final architecture, we may need to put more focus on monitoring on certain parts. Components such as the queue will need to be heavily monitored
- Measurements on the queue
- size of the queues
- throughput
- exceptions thrown
Security
We don't want to leave security till the end. We agreed that we want to bake security into the system from the start of the project. There are areas we've identified we'll need to work on
- Services
- We'll use the current JWT implementation
- Queue
- We'll need to read up on the documentation on how to harden/lock down/productionize a RabbitMQ cluster
- Storage container
- Azure already encrypts data at rest using AES and in transport
- Cloud Infra
- Use whatever cloud level security measures, so whitelisting, opening up required ports etc
Development & Deployment Strategy
This project will require both the fixing of existing bugs and the rearchitecting for this new system. We will need to take extra care in being efficient and not duplicate work where possible and ensure that any new work being done for the "New world" will not effect the current TIS system nor the current ESR integration.
We spoke about running the new features within feature flags so that the code will not run during normal dev through to deployment on Prod. This would also mean we'll need new environments. It was suggested that Dev2 and Stage2 could be created and have the feature flags switched on there.
If we continue to use spring boot and use the out of the box feature for connecting to a message queue, we may need to extend and modify the auto configuration classes to disable auto connection to a queue. This is because the default behaviour is to load the configuration for connectivity if certain libraries have been loaded onto the classpath. So deployments to dev → prod will need to connect to a queue even though it wont be using it.
Costs
Having this new architecture will mean that we'll need to have at least one machine to host the message queue system. We'll need to come back to find ways to save on costs as having a resilient system is paramount but can be expensive
** https://cloud.spring.io/spring-cloud-contract/reference/htmlsingle/
Add Comment