2021-09-08 ESR Integration database less resilient

Date

Sep 8, 2021

Authors

@Joseph (Pepe) Kelly

Status

Documenting

Summary

https://hee-tis.atlassian.net/browse/TIS21-2098

Impact

No end user impact. The database that holds the information for communicating with ESR was less resilient for a couple of hours and the integration was paused for several hours.

Non-technical Description

The database has 3 copies for resilience against failures. Attempting to create a backup used all disk space available to one of these copies, one “node” in the mongo replicaset. This caused the node to fail. The synchronisation between it and the other nodes meant that it could not automatically recover when the disk space was released again.

The data for the node was removed to allow a “full sync/initialisation” to take place. This took several hours and the integration services were switched off during this period to prevent cascading issues. When the services resumed, the pending events from TIS, e.g. updates to personal details, were processed.


Trigger

  • Disk space full


Detection

  • TIS team. The backup process was being monitored.

     


Resolution

  • Reinitialising the data directory for the failed node and allowing a full synchronisation to take place.

  • Restarting the machine the database was on following the synchronisation


Timeline

  • Sep 8, 2021 16:30 BST - Backup started

  • Sep 8, 2021 16:44 BST - Disk full

  • Sep 8, 2021 17:46 BST - After investigating several options, began a full re-sync of the node

  • Sep 8, 2021 19:19 BST - Node transitioned to secondary but the the VM was still non-responsive, EC2 status check started failing

  • Sep 8, 2021 21:44 -22:10 BST - Restarted the machine and checked replicaset reported as healthy

  • Sep 9, 2021 00:04 BST - All integration services restarted and a problem message cleared from the message broker

Root Cause(s)

  • Backup written to the same device used for data & transaction logs

 

 


Action Items

Action Items

Owner

 

Action Items

Owner

 

Increase storage available to Mongo

 

 

Create a repeatable process for copying data to stage environment

 

 

 

 

 

 

 

 


Lessons Learned

  • Just never output to the same device that data is written on.