Date | |||||||||
Authors | Cai WillisDoris.Wong Jayanta Saha Joseph (Pepe) Kelly Adewale Adekoya | ||||||||
Status | Patched, Root Cause Found, Solution to Root Cause in ProgressResolved | ||||||||
Summary |
| ||||||||
Impact | Large numbers of logs generated requiring ~5 minutes downtime of recommendations. Application was becoming slow beforehand due to message processing. |
...
Our error notification system (Sentry) led us to investigate logs this morning.reported an error at 11pm the previous night.
Further investigation showed this error to have occurred multiple times since 4pm the previous day
awslogs-prod-tis-revalidation-recommendation logs were up to 1.2GB, we were getting lots of
Execution of Rabbit message listener failed . . . Caused by: java.lang.NullPointerException
errors.Investigation of RabbitMq console app revealed a single endlessly requeuing message.
...
The final step is improving the handling of null deferral reasons and sub reasons (in progress)- this was the root cause.
...
Timeline
9:03 - Cai Willis reported the errors
9:35 - Investigation started
9:40 - issue reported to users and Recommendation paused for 5 minutes
9:50 - Temporary fix made
9: 50- Comms sent to users that recommendation was back
10: 40- Preventative measure deployed to recommendation service (prevent requeuing)
12: 05- Likely root cause discovered
13: 15- Root cause solution deployed to production environment
...
Root Cause(s)
Some unexpected data got on
reval.queue.recommendationstatuscheck.updated.recommendation
Poor handling of null values in deferral reasons
Default behaviour of requeuing messages when exception thrown
...