Using a managed RabbitMQ broker

Why

All the usual stuff about using a managed service

Validation

We haven’t looked a performance. If only because the MongoDB cluster, hosted on a single VM has been the most fragile part of the infrastructure.

We tested this on stage by taking down MongoDB in the middle of processing messages:

  1. Uploading a file to S3

  2. Restarting MongoDB

  3. Identify a transaction that has been retried

  4. Find evidence that the transaction has been tried

 

Results

  1. We uploaded in/DE_EMD_RMC_20200116_00002188_TESTT.DAT in stage.

  2.  

  3. Using the logs for the reconciliation service we found an Exception for a message with Correlation ID 2f75f90a-0a65-4ef2-8164-1890b2d16df9.

    2021-02-24 15:50:56.533 ERROR 1 --- [ntContainer#0-1] c.h.t.e.r.listener.RmtRecordListener : Runtime exception was thrown while processing inbound RMT Record for correlationId : [2f75f90a-0a65-4ef2-8164-1890b2d16df9] org.springframework.data.mongodb.UncategorizedMongoDbException: Command failed with error 112 (WriteConflict): 'WriteConflict' on server mongo3:27013. The full response is {"errorLabels": ["TransientTransactionError"], "operationTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "ok": 0.0, "errmsg": "WriteConflict", "code": 112, "codeName": "WriteConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "signature": {"hash": {"$binary": {"base64": "AFS2+t9qfcWNvqCjKlsUGd79ZVE=", "subType": "00"}}, "keyId": 6875719810232090627}}}; nested exception is com.mongodb.MongoCommandException: Command failed with error 112 (WriteConflict): 'WriteConflict' on server mongo3:27013. The full response is {"errorLabels": ["TransientTransactionError"], "operationTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "ok": 0.0, "errmsg": "WriteConflict", "code": 112, "codeName": "WriteConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "signature": {"hash": {"$binary": {"base64": "AFS2+t9qfcWNvqCjKlsUGd79ZVE=", "subType": "00"}}, "keyId": 6875719810232090627}}}
  4. We looked through the audit log via Metabase for the correlation id:

    1. This query of the audit log shows the failed message (Metabase ID: 603675e55fca3b29ce78495c) and subsequent messages.

    2. The delay added by the retry is visible in the message properties:

      ... { "reason": "expired", "count": 1, "exchange": "ex.error", "time": "2021-02-24T15:51:01Z", "routing-keys": [ "esr.porpos.split" ], "queue": "q.error.delay" }, ... "timestamp": "2021-02-24T15:50:56Z" ...

       

    3. subsequent messages from the “position saved” message can be seen in the audit log.