Date

09 Feb 2022

Authors

Cai Willis Doris.Wong Jayanta Saha Joseph (Pepe) Kelly Adewale Adekoya

Status

Patched, Root Cause Found, Solution to Root Cause in Progress

Summary

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-2653

Impact

Large numbers of logs generated requiring ~5 minutes downtime of recommendations. Application was becoming slow beforehand due to message processing.

Non-technical Description

Errors being generated by some bad data (for one gmc number only- A large number of errors were being generated during the hourly recommendation status check (which is used to synchronise recommendation status in gmc with the revalidation app). A single Doctor GMC No: 7064627.) being returned by the hourly recommendation status check from gmc. these errors are also creating snapshots, so mongoDB might be filling up

Trigger

...

had been returned with a deferral reason but no sub reason - which our application failed to handle gracefully.

The logs indicated that a large number of snapshots might have been being generated which could have been adding a large number of unnecessary records to the database, so quick action was taken that required a reboot of the recommendations app. Thankfully this was a red herring

Since then some further changes have been made to the application to improve resilience and prevent this sort of issue in future:

We have prevented erroneous updates from endlessly retrying
We have improved logging and error handling
We have improved the handling of null deferral reasons and sub reasons

...

Trigger

By default any message received that generates an error is requeued - if this is not handled properly it will endlessly requeue and generate error logs.

In addition, poor handling of null values in deferral sub reasons caused this error in the first place.

Detection

Our error notification system (Sentry) led us to investigate logs this morning.
awslogs-prod-tis-revalidation-recommendation logs are were up to 1.2GB, and looks like we were getting lots of Execution of Rabbit message listener failed . . . Caused by: java.lang.NullPointerException errors this morning (not sure if this is what accounts for that 1.2GB though) on the recommendation status check. looking into it. Prod seems fine though (edited)

Resolution

.
Investigation of RabbitMq console app revealed a single endlessly requeuing message.

...

Resolution

Initial quick patch was to purge this message from the queue (this required temporarily stopping recommendation service in order to successfully catch the message to be purged)

Second step was an update to the recommendations app to prevent requeuing of error generating messages - this was done quickly to prevent the next hourly job starting the same problem all over again

Third step was improving the logging and messages around messages received by the recommendation service

The final step is improving the handling of null deferral reasons and sub reasons (in progress)

...

Timeline

09 Feb 2022 9:03 - Cai Willis reported the errors
09 Feb 2022 9:35 - Investigation started
09 Feb 2022 9:40 - issue reported to users and Recommendation paused for 5 minutes
09 Feb 2022 9:50 - Temporary fix made
09 Feb 2022 9: 50- Comms sent to users that recommendation was back
09 Feb 2022 10: 40- Preventative measure deployed to recommendation service (prevent requeuing)
09 Feb 2022 12: 05- Likely root cause discovered

...

Root Cause(s)

Poor handling of null values in deferral reasons
Default behaviour of requeuing messages when exception thrown

...

Action Items

Action Items

Owner

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-2653

Cai Willis

...

Lessons Learned

Increased monitoring (e.g. Sentry slack integrations) enabled us to solve this issue very quickly
Care needs to be taken when using message brokers to handle exception-generating messages appropriately
There are still some areas of the Reval Recommendation service that need improvement in terms of robustness and best practices

Versions Compared

Old Version 3

New Version 4

Key

Non-technical Description

Trigger

Trigger

Detection

Resolution

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 3

New Version 4

Key

Non-technical Description

Trigger

Trigger

Detection

Resolution

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned