Using a managed RabbitMQ broker

Why

All the usual stuff about using a managed service

Validation

We haven’t looked a performance. If only because the MongoDB cluster, hosted on a single VM has been the most fragile part of the infrastructure.

We tested this on stage by taking down MongoDB in the middle of processing messages:

Uploading a file to S3
Restarting MongoDB
Identify a transaction that has been retried
Find evidence that the transaction has been tried

Results

We uploaded in/DE_EMD_RMC_20200116_00002188_TESTT.DAT in stage.
Using the logs for the reconciliation service we found an Exception for a message with Correlation ID 2f75f90a-0a65-4ef2-8164-1890b2d16df9.
2021-02-24 15:50:56.533 ERROR 1 --- [ntContainer#0-1] c.h.t.e.r.listener.RmtRecordListener : Runtime exception was thrown while processing inbound RMT Record for correlationId : [2f75f90a-0a65-4ef2-8164-1890b2d16df9] org.springframework.data.mongodb.UncategorizedMongoDbException: Command failed with error 112 (WriteConflict): 'WriteConflict' on server mongo3:27013. The full response is {"errorLabels": ["TransientTransactionError"], "operationTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "ok": 0.0, "errmsg": "WriteConflict", "code": 112, "codeName": "WriteConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "signature": {"hash": {"$binary": {"base64": "AFS2+t9qfcWNvqCjKlsUGd79ZVE=", "subType": "00"}}, "keyId": 6875719810232090627}}}; nested exception is com.mongodb.MongoCommandException: Command failed with error 112 (WriteConflict): 'WriteConflict' on server mongo3:27013. The full response is {"errorLabels": ["TransientTransactionError"], "operationTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "ok": 0.0, "errmsg": "WriteConflict", "code": 112, "codeName": "WriteConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "signature": {"hash": {"$binary": {"base64": "AFS2+t9qfcWNvqCjKlsUGd79ZVE=", "subType": "00"}}, "keyId": 6875719810232090627}}}
We looked through the audit log via Metabase for the correlation id:
1. This query of the audit log shows the failed message (Metabase ID: 603675e55fca3b29ce78495c) and subsequent messages.
2. The delay added by the retry is visible in the message properties:
  ... { "reason": "expired", "count": 1, "exchange": "ex.error", "time": "2021-02-24T15:51:01Z", "routing-keys": [ "esr.porpos.split" ], "queue": "q.error.delay" }, ... "timestamp": "2021-02-24T15:50:56Z" ...
3. subsequent messages from the “position saved” message can be seen in the audit log.