Errored We use a message based system and a common problem with these systems is what is done with errored messages. Our errored messages sent to error exchange and eventually end up on errored queue . this (In JMS speak this is the DLQ or Dead Letter Queue). This queue may have multiple types of messages e.g. schema validation, too many retries, invalid message, POR/POS details not able to find etc… some maybe recoverable and some may not be. e.g. Schema validation is most likely forever errored, but too many retries, maybe an issue with TCS or a dependant service . What and we may want to rerun the message. Our intial ideas for what to do with messages and message types is listed below.
...
When handling errors we need to consider, what can be fixed, what should be reported and timelyness.
Volume we may deal with, currently it is approx: 9k messages in about 5 weeks. 2k messages per week.
Classify We probably want to start by classifying by exception type. Below are our intial classifications which we expect to evolve and change as we use the system and what we believe we should do with the messages:
-Schema validation, we want to clear from error queue and report using audit.
...
-Invalid message within system, we want to keep and look at investigate the system.
-Other generic failures, we want to keep and look at systeminvestigate.
Action: Need ticket to look at how we do above to clear down errors to only ones we will need to address. Do we use audit entirely for some of this?
What makes Our next questions is what will make us look at these issues? Who looks at these? ( Support question)
High rate? High overall number? Time frame?
All We believe likely all of the above
Time frame based - do Do we want a regular task to review the queue? Look at known issues and unknown issues, should be resolving and clearing queue and looking at how we can improve categorisation and resolution in the system? Important kpi and health of system, do we want this as an intregral part of sprint? Do we use tech improvement work to look at this? Some of info is time sensitive, we have 12 week CoP etc… timeframe is good needs to ensure we dont hit that.
Do we need circuit breaker? probably not, timestamps used in processing, so should be ok.
HIgh Decision: High rate and timeframe based checks. will be our starting point.
-Action: Timeframe based checks, not sure what timeframe should be, to start lets have a dedicated task per sprint until we find a better way of doing this and learn as we go.
...
-How many rows per day? do we want a proportion of messages rather than absolute figure .e.g. in an hour we are looking at < x% error rate? Try as a benchmark and refine? look at how we produce numbers before deciding exact figure.
Action: Ticket for producing this error rate and alerting? Look at what prometheus etc… is producing and whether we can tap into that?
...