With a design largely defined, there were a lot of questions yet to be answered and decisions yet to be made. So we decided to have a whiteboard session to map the areas that we thought we needed to address before continuing with the new implementation.
Decisions
Queuing system
- We agreed on using RabbitMQ as it seemed to be the most user friendly with possibly the largest market share
- Multiple queues will be used with the basis of a singular inbound queue, multiple internal ones and possibly multiple outbound queues
- The queues will use binding keys to route the messages rather than headers
- As a default, auto ack will be disabled which means that a consumer will need to manually acknowledge a message in order for it to be removed from the queue. This should only be done after the message has been processed and and output to be either pushed on another queue or saved. This ensures that messages will not be lost when consumers fail
- We will store the amount of retries on the message and a max of 10 retries will be allowed
- The wait time (back off) algo will be 1 minute before it can be processed again
- We'll attempt to use an intelligent circuit breaker which will look at the type of exception and judge whether it would make sense to retry. Errors such as HTTP 400's wont make sense to retry as it will be an issue with the client (which will most likely require dev)
- Dead letters will be split between different dead letter queues by type
- We'll need to come back to rate limiting as we do not want to overwhelm our own systems
- we'll need to base this on some actual metrics
- We'll have duplicate queues so that we can create an audit service to record message/events
- The granularity of audits will need to be defined as too granular may create an influx of data which may not be useful
Messages - on hold
- As we're running messages through the system, it may be possible to reduce the amount of messages within a certain period
- There may be data that we don't care about and can remove from the messages
Schema
To increase quality of the data, we'll need a way to validate the inbound dataset
- Some form of schema will be required to validate the data. We already mentioned Json with Json schema would be nice
- Data that doesn't conform to the schema will be dropped into the deadletter queue
- We should probably validate data coming out of the system too
Auditing
This will be a vital part of the system as currently we don't know much on what's what in the current implementation. Finding details of when, what and why something has happened is very difficult
- We should log the actual data in the system
- log meta data with the actual data (date created, modified, service name, triggers, exceptions etc)
- It would be good to use this information to create a system that can log a journey through the system and then intelligently indicate if something is yet to be exported or still running through the system
Reporting
Feedback from a workshop (Sept 25th)
Database
What sort of database would we need? Do we need one at all?
- Will need to be fast
- Might be good to work will with the message body (Json)
- Doesn't look like we need a relational DB so a document store may be enough (Mongo/Cosmos/Dynamo)
- Does this tie into how we audit?
- What features to we need?
- UX?
- Whats the debug journey like?
- What key terms do we search by?
- What sort of issues do we have?
Managed Services - To revisit
It may be worth using the cloud providers managed services from the get go but we'll need to consider migration issues and the business case for resiliency
Cloud native - K8s & Spring boot - To revisit
There is work on going to assess and possibly migrate to a Kubernetes infrastructure. As TIS is currently deployed onto the cloud, some thought needs to be had to make it cloud native. There are certain features on the cloud native space that are shared in both kubernetes and Spring boot that are not compatible with each other (Service registration, failover, retries etc)
Automated Testing
It agreed that any high quality system will require a suite of automated tests. The team currently has strong experience with testing and the unit level but has identified that we may need to up skill in terms of integration / end to end tests.
Frameworks/Techniques such has rest assured, mocking and contract testing were discussed**, also a FT dedicated tester was discussed
Monitoring
Depending on the final architecture, we may need to put more focus on monitoring on certain parts. Components such as the queue will need to be heavily monitored
- Measurements on the queue
- size of the queues
- throughput
- exceptions thrown
** https://cloud.spring.io/spring-cloud-contract/reference/htmlsingle/
Add Comment