...
??:?? - Did anything happen earlier (i.e. Pepe thinks this behaviour may have been observed on stage) - Staging had been failing in a similar way prior to pushing the data cleanup script - restarting the recommendation service appeared to resolve everything,the fact that staging was already being odd probably clouded the impact of the script (but also as staging data is different to prod data, it’s possible that this may not have happened anyway)
13:00 - Snapshot of prod db created .
13:07 - Started push to prod.
13:29 - Push to prod complete (longer than unusual - attributed this to mongock running).
13:30 - Issue identified by inspection of application in browser
13:47 - Users informed
13:49 - Application working intermittently
14:00 - Application appeared to have stabilised
14:16 - Users informed of resolution
...
Root Cause(s)
Intensive mongock changeset appeared to block documentDb service
Possible hole in our monitoring meaning that there was no indication e.g. from logging for when the application can’t connect to the database
Possible use of single instance not sustainable?
...
Action Items
Action Items | Owner |
---|---|
...
Lessons Learned
Maybe perform data operations outside of core hours where possible, and warn users about possible disruption
Break up mongock changesets into smaller operations where possible