Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Date

Authors

Cai Willis

Status

Documenting

Summary

Revalidation was intermittently not working or unusably slow

Impact

No-one was able to reliably see or do any in the revalidation website.

Non-technical Description

A bulk background change to data caused the system to be too busy to process normal activity.


Trigger

  • A bulk data cleanup script (mongock) proved to be too intensive and blocked connection to the database from the application

Detection

  • .Visual inspection of the site after the change had been deployed

  • Interesting to note that at no point were errors logged by any of the services despite no data being returned to the front end (and in fact the services were reporting 200s)


Resolution

  • Database connection eventually restored itself with no intervention


Timeline

  • ??:?? - Did anything happen earlier (i.e. Pepe thinks this behaviour may have been observed on stage) - Staging had been failing in a similar way prior to pushing the data cleanup script - restarting the recommendation service appeared to resolve everything,the fact that staging was already being odd probably clouded the impact of the script (but also as staging data is different to prod data, it’s possible that this may not have happened anyway)

  • 13:00 - Snapshot of prod db created .

  • 13:07 - Started push to prod.

  • 13:29 - Push to prod complete (longer than unusual - attributed this to mongock running).

  • 13:30 - Issue identified by inspection of application in browser

  • 13:47 - Users informed

  • 13:49 - Application working intermittently

  • 14:00 - Application appeared to have stabilised

  • 14:16 - Users informed of resolution


Root Cause(s)

  • Intensive mongock changeset appeared to block documentDb service

  • Possible hole in our monitoring meaning that there was no indication e.g. from logging for when the application can’t connect to the database

  • Possible use of single instance not sustainable?


Action Items

Action Items

Owner


Lessons Learned

  • Maybe perform data operations outside of core hours where possible, and warn users about possible disruption

  • Break up mongock changesets into smaller operations where possible

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.