...
Our error notification system (Sentry) led us to investigate logs this morning.reported an error at 11pm the previous night.
Further investigation showed this error to have occurred multiple times since 4pm the previous day
awslogs-prod-tis-revalidation-recommendation logs were up to 1.2GB, we were getting lots of
Execution of Rabbit message listener failed . . . Caused by: java.lang.NullPointerException
errors.Investigation of RabbitMq console app revealed a single endlessly requeuing message.
...
9:03 - Cai Willis reported the errors
9:35 - Investigation started
9:40 - issue reported to users and Recommendation paused for 5 minutes
9:50 - Temporary fix made
9: 50- Comms sent to users that recommendation was back
10: 40- Preventative measure deployed to recommendation service (prevent requeuing)
12: 05- Likely root cause discovered
13: 15- Root cause solution deployed to production environment
...
Root Cause(s)
Some unexpected data got on
reval.queue.recommendationstatuscheck.updated.recommendation
Poor handling of null values in deferral reasons
Default behaviour of requeuing messages when exception thrown
...