2021-07-30 Old RabbitMQ prod broker deleted before the EsrDataExport service was moved to new RabbitMQ broker

Date

Jul 30, 2021

Authors

@Marcello Fabbri (Unlicensed) @Liban Hirey (Unlicensed)

Status

Done

Summary

When a new RabbitMQ broker was created in AWS, we omitted migrating the ESR DataExport service to it (all other services were migrated fine). When the old broker was deleted it triggered errors on the DataExport service.

Impact

No Users reported any problems during the incident.

Non-technical Description

The ESR DataExport service was not migrated over in time therefore there was a possibility of having some application and notification confirmations missing following the deletion of the old broker.


Trigger

Deletion of the old RabbitMQ broker in AWS.


Detection

Slack notification from Sentry @ 10:11 on 30/07/2021.

Resolution

  • Updated the configuration for the DataExportService.

  • The DataExportService not being on the new Rabbit broker for longer than a day made us presume that evidence of missing data could be found by investigating the esr.queue.apprecord.created.dataexporter queue, where messages would be likely to stall unconsumed by any service (until the deletion of the old broker). The Audit database though would still have these messages, so we’d be able to see if anything went through that queue during that period of time, and if that flow of data was successful or not.

 

  • We looked at a message found to have gone through that queue during the period the DataExportService was not the new Rabbit broker (a placement update, at 2:31 AM on 30/07/2021) and retrieved its trail through its correlationId: metabase link.

     

  • We checked whether that record had been exported by the DataExporter or not and it seems like it did, presumably proving data has been exported regardless of this issue (probably stored/pent up somewhere in the ESR interface and then finally flowing through once the DataExportService got put on the new Rabbit broker).

 

This list of GeneratedAppRecords appears to have been successfully exported by the EsrDataExportService, as seen on the PendingExport table. However some Placements have not updated their esr status on TIS and will require to be manually corrected.

Timeline

Jul 26, 2021: 15:00 - Other services migrated to new RabbitMQ broker (STAGE)

Jul 27, 2021: 11:25 - Other services migrated to new RabbitMQ broker (PROD)

Jul 30, 2021: 10:05 - Old RabbitMQ broker deleted from AWS by Liban

Jul 30, 2021: 10:11 - An alert in the #sentry-esr channel

Jul 30, 2021: 10:32 - Pepe alerted the dev team

Jul 30, 2021: 10:56 - Liban switches DataExport service over to the new RabbitMQ broker after realising that it doesn’t exist in it

Jul 30, 2021: 11:11 - Liban and Marcello start investigating to verify whether any data is missing

e.g. check DB for Applicants exported 27th-29th:

  • Placements should have a tick

  • Message in the audit database

  • Confirm Exporter sent Applicants as expected

 

Root Cause(s)

Deletion of the old RabbitMQ broker in AWS.

Dispersed/scattered configuration

Lessons Learned