...
We probably want to start by classifying by exception type. Below are our intial classifications which we expect to evolve and change as we use the system and what we believe we should do with the messages:-
Schema validation, we want to clear from error queue and report using audit.
...
Maximum retries, we want to keep in error queue and retry when appropriate.
...
Invalid message within system, we want to keep and investigate the system.
...
Other generic failures, we want to keep and investigate.
Action: Need ticket to look at how we do above to clear down errors to only ones we will need to address. Do we use audit entirely for some of this?
...
Action: Timeframe based checks, not sure what timeframe should be, to start lets have a dedicated task per sprint until we find a better way of doing this and learn as we go.-
High rate?
...
How many rows per day? do we want a proportion of messages rather than absolute figure .e.g. in an hour we are looking at < x% error rate? Try as a benchmark and refine? look at how we produce numbers before deciding exact figure.
Action: Ticket for producing this error rate and alerting? Look at what prometheus etc… is producing and whether we can tap into that?
...
What skills does someone need to look at this? : domain knowledge, business knowledge, technical knowledge? (Needs revisiting and discussion with wider team on this)
View file | ||
---|---|---|
|