Error handling process (Draft)

Errored messages sent to error exchange and end up on errored queue. this queue may have multiple types of messages e.g. schema validation, too many retries, invalid message, POR/POS details not able to find etc… some maybe recoverable and some may not be. e.g. Schema validation forever errored, but too many retries, maybe an issue with TCS or a dependant service. What to do with messages and message types is listed below.

When handling errors we need to consider, what can be fixed, what should be reported and timelyness.

Volume approx: 9k messages in about 5 weeks. 2k messages per week.

Classify by exception type:

-Schema validation, we want to clear from error queue and report using audit.

-Maximum retries, we want to keep in error queue and retry when appropriate.

-Invalid message within system, we want to keep and look at system.

-Other generic failures, we want to keep and look at system.

Need ticket to look at how we do above to clear down errors to only ones we will need to address. Do we use audit entirely for some of this?

What makes us look at these issues? Who looks at these? ( Support question)

High rate? High overall number? Time frame?

All of the above

Time frame - do we want a regular task to review? Look at known issues and unknown issues, should be resolving and clearing queue and looking at how we can improve categorisation and resolution in the system? Important kpi and health of system, do we want this as an intregral part of sprint? Do we use tech improvement work to look at this? Some of info is time sensitive, we have 12 week CoP etc… timeframe is good to ensure we dont hit that.

Do we need circuit breaker? probably not, timestamps used in processing, so should be ok.

HIgh rate and timeframe based checks. will be our starting point.

-Timeframe based checks, not sure what timeframe should be, to start lets have a dedicated task per sprint until we find a better way of doing this and learn as we go.

-High rate?

-How many rows per day? do we want a proportion of messages rather than absolute figure .e.g. in an hour we are looking at < x% error rate? Try as a benchmark and refine? look at how we produce numbers before deciding exact figure.

Ticket for producing this error rate and alerting? Look at what prometheus etc… is producing and whether we can tap into that?

What skills does someone need to look at this? : domain knowledge, business knowledge, technical knowledge? (Needs revisiting and discussion with wider team on this)