Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Cai WillisDoris.Wong Jayanta Saha Joseph (Pepe) Kelly Adewale Adekoya

Status

Patched, Root Cause Found, Solution to Root Cause in ProgressResolved

Summary

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-2653

Impact

Large numbers of logs generated requiring ~5 minutes downtime of recommendations. Application was becoming slow beforehand due to message processing.

...

  • Our error notification system (Sentry) led us to investigate logs this morning.reported an error at 11pm the previous night.

    • Further investigation showed this error to have occurred multiple times since 4pm the previous day

  • awslogs-prod-tis-revalidation-recommendation logs were up to 1.2GB, we were getting lots of Execution of Rabbit message listener failed . . . Caused by: java.lang.NullPointerException errors.

  • Investigation of RabbitMq console app revealed a single endlessly requeuing message.

...

The final step is improving the handling of null deferral reasons and sub reasons (in progress)- this was the root cause.

...

Timeline

  • 9:03 - Cai Willis reported the errors

  • 9:35 - Investigation started

  • 9:40 - issue reported to users and Recommendation paused for 5 minutes

  • 9:50 - Temporary fix made

  • 9: 50- Comms sent to users that recommendation was back

  • 10: 40- Preventative measure deployed to recommendation service (prevent requeuing)

  • 12: 05- Likely root cause discovered

  • 13: 15- Root cause solution deployed to production environment

...

Root Cause(s)

  • Some unexpected data got on reval.queue.recommendationstatuscheck.updated.recommendation

  • Poor handling of null values in deferral reasons

  • Default behaviour of requeuing messages when exception thrown

...