Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly

Status

Documenting

Summary

We received loads of sentry errors

Impact

Table of Contents

Non-technical Description

We have a {Lambda} job running up to once per minute that follows changes in the revalidation database and it . It keeps track of what the most recent change processed was . The in order to know where it should pick up next time. We received an alert that the reference to that latest change couldn’t be used. ???Because…???The reference to the latest change was no longer available

We moved the old reference and made a change to the code to enable the {Lambda} job to restart tracking changes.it

Since Between Saturday morning until and Monday morning the some changes in the Reval Database could would not be have been reflected on Reval UI ., however as there were no changes during this period

Trigger

  • There were a lot of Sentry errors at about 2,800 unclear whether anything was missed?

...

Detection

...

Resolution

...

Timeline

All times in GMT unless indicated

  • 18:07 - Last recommendation submitted

  • 05:00-05:29 - Backup window

  • 05:29 - Earliest Slack message identifying an issue with the production CDC Lambda

  • ~06:00 - renamed attribute in the database collection. This caused other issues which were resolved by applying a hotfix to generate a new reference.

  • -

  • -

Root Cause(s)

...

Action Items

Action Items

Owner

Investigating why the position in capped collection was deleted (see errors on weekend):

  • When were the last changes to recommendation records before Sat 3rd Feb?

    The previous change was on Friday evening

  • How is the Document DB change stream configured? Is it a 7 day window?
    The change stream retention period is the default 3 hours but is only removed once the log is full with an AWS configured 51,200MB of data. It is possible that the data represents the first time this threshold was crossed without additional writes since the cluster was last modified. Increasing the retention period will impact storage and performance.

Joseph (Pepe) Kelly

Increase the change stream retention period.

Make the resume token check work if there isn’t one…

  1. Upgrade and

    1. Use a time based check

    2. Change the Lambda to use Change Stream events directly

    3. look for how to directly integrate the change stream with SQS or other asynchronous messaging service

  2. Check & modify DocumentDB resume token to “upsert” a new value.

...

Lessons Learned