Date

29 Mar 2022

Authors

Status

Documenting

Summary

Revalidation was intermittently not working or unusably slow

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-2804

Impact

No-one was able to reliably see or do any in the revalidation website.

Non-technical Description

? A bulk background change to data caused the system to be too busy to process normal activity.

Its possible there was some overlap with an attack that was being experienced by GMC, so this has muddied the waters a bit

...

Trigger

A bulk data cleanup script (mongock) proved to be too intensive and blocked connection to the database from the application

Detection

.Visual inspection of the site after the change had been deployed

Resolution

.
.
.Interesting to note that at no point were errors logged by any of the services despite no data being returned to the front end (and in fact the services were reporting 200s)

...

Resolution

Database connection eventually restored itself with no intervention

...

Timeline

28 Mar 2022 ??:?? - Did anything happen earlier (i.e. Pepe thinks this behaviour may have been observed on stage) - Staging had been failing in a similar way prior to pushing the data cleanup script - restarting the recommendation service appeared to resolve everything,the fact that staging was already being odd probably clouded the impact of the script (but also as staging data is different to prod data, it’s possible that this may not have happened anyway)
29 Mar 2022 13:00 - Snapshot of prod db created .
29 Mar 2022 ?? 13:?? - 07 - Started push to prod.
29 Mar 2022 ??:?? - . 13:29 - Push to prod complete (longer than unusual - attributed this to mongock running).
29 Mar 2022 13:30 - Issue identified by inspection of application in browser
29 Mar 2022 13:47 - Users informed
29 Mar 2022 13:49 - Application working intermittently
29 Mar 2022 14:00 - Application appeared to have stabilised
29 Mar 2022 14:16 - Users informed of resolution

...

Root Cause(s)

Intensive mongock changeset appeared to block documentDb service
Possible hole in our monitoring meaning that there was no indication e.g. from logging for when the application can’t connect to the database
Possible use of single instance not sustainable?

...

Action Items

Action Items	Owner

...

Lessons Learned

Maybe perform data operations outside of core hours where possible, and warn users about possible disruption
Break up mongock changesets into smaller operations where possible

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned