2024-02-03 Weekend recommendation changes wouldn't have been visible

Date

Feb 5, 2024

Authors

@Joseph (Pepe) Kelly

Status

Done

Summary

https://hee-tis.atlassian.net/browse/TIS21-5681 We received loads of sentry errors which showed that recommendation changes over the weekend wouldn’t have been reflected in the reval search.

Impact

No user impact determined.

Non-technical Description

We have a job running up to once per minute that follows changes in the revalidation database. It keeps track of what the most recent change processed was in order to know where it should pick up next time. We received an alert that the reference to that latest change couldn’t be used. The reference to the latest recommendation change was no longer available. We were fortunate that changes to doctors continue.

We moved the old reference and made a change to the code to enable the job to restart tracking changes.

Between Saturday morning and Monday morning some changes in the Reval Database would not have been reflected on Reval UI, however as there were no changes during this period

Trigger

  • There were a lot of Sentry errors at about 2,800. It doesn’t look like anything recommendations were made and so we were fortunate nothing was affected.

  • Combination of 3 hour window having elapsed and transaction log rolling over


Detection

  •  


Resolution

  •  


Timeline

All times in GMT unless indicated

  • Feb 2, 2024 18:07 - Last recommendation submitted.

  • Feb 3, 2024 05:00-05:29 - Backup window

  • Feb 3, 2024 05:00-05:29 - Backup window

  • Feb 3, 2024 05:29 - Earliest Slack message identifying an issue with the production CDC Lambda

  • Feb 5, 2024 ~06:00 - renamed attribute in the database collection. This caused other issues which were resolved by applying a hotfix to generate a new reference until a more robust solution is implemented

Root Cause(s)

  • Sentry errors were caused by a reference which could not be found

  • The reference was to the change stream for the last change in the recommendation collection, which was presumably cleaned up

  • The change stream was cleaned up*

  • The change stream is configured to hold references for at least to 3 hours (default)

 

 


Action Items

Action Items

Owner

Comments

Action Items

Owner

Comments

Investigating why the position in capped collection was deleted (see errors on weekend):

  • When were the last changes to recommendation records before Sat 3rd Feb?

    The previous change was on Friday evening

  • How is the Document DB change stream configured? Is it a 7 day window?
    The change stream retention period is the default 3 hours but is only removed once the log is full with an AWS configured 51,200MB. It is possible that the data represents the first time this threshold was crossed without additional writes since the cluster was last modified. Increasing the retention period will impact storage and performance.

@Joseph (Pepe) Kelly

Done

 

 

 

Increase the change stream retention period.

https://hee-tis.atlassian.net/browse/TIS21-5713

 

Make the resume token check work if there isn’t one…

  1. Upgrade and

    1. Use a time based check

    2. Change the Lambda to use Change Stream events directly

    3. look for how to directly integrate the change stream with SQS or other asynchronous messaging service

  2. Check & modify DocumentDB resume token to “upsert” a new value.

Not right now. We can review this if the database is upgraded as we expect to do soon.

 


Lessons Learned

  •