Date	08 Sep 2021
Authors	Joseph (Pepe) Kelly
Status	Documenting
Summary	https://hee-tis.atlassian.net/browse/TIS21-2098
Impact	No end user impact. The database that holds the information for communicating with ESR was less resilient for a couple of hours and the integration was paused for several hours.

Non-technical Description

The database has 3 copies for resilience against failures. Attempting to create a backup used all disk space available to one of these copies, one “node” in the mongo replicaset. This caused the node to fail. The synchronisation between it and the other nodes meant that it could not automatically recover when the disk space was released again.

The data for the node was removed to allow a “full sync/initialisation” to take place. This took several hours and the integration services were switched off during this period to prevent cascading issues. When the services resumed, the pending events from TIS, e.g. updates to personal details, were processed.

Trigger

Disk space full

Detection

TIS team. The backup process was being monitored.

Resolution

Reinitialising the data directory for the failed node and allowing a full synchronisation to take place.
Restarting the machine the database was on following the synchronisation

Timeline

08 Sep 2021: BST - Backup run
08 Sep 2021: BST -
08 Sep 2021: BST -

Root Cause(s)

Backup written to

Action Items

Action Items	Owner

Lessons Learned

Just never output to the same device that data is written on.

Copy of 2021-09-08 Person Placement Employing Body Trust sync job failed affecting Person Search