Using a managed RabbitMQ broker
Why
All the usual stuff about using a managed service
Validation
We haven’t looked a performance. If only because the MongoDB cluster, hosted on a single VM has been the most fragile part of the infrastructure.
We tested this on stage by taking down MongoDB in the middle of processing messages:
Uploading a file to S3
Restarting MongoDB
Identify a transaction that has been retried
Find evidence that the transaction has been tried
Results
We uploaded
in/DE_EMD_RMC_20200116_00002188_TESTT.DAT
in stage.Using the logs for the reconciliation service we found an Exception for a message with Correlation ID
2f75f90a-0a65-4ef2-8164-1890b2d16df9
.2021-02-24 15:50:56.533 ERROR 1 --- [ntContainer#0-1] c.h.t.e.r.listener.RmtRecordListener : Runtime exception was thrown while processing inbound RMT Record for correlationId : [2f75f90a-0a65-4ef2-8164-1890b2d16df9] org.springframework.data.mongodb.UncategorizedMongoDbException: Command failed with error 112 (WriteConflict): 'WriteConflict' on server mongo3:27013. The full response is {"errorLabels": ["TransientTransactionError"], "operationTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "ok": 0.0, "errmsg": "WriteConflict", "code": 112, "codeName": "WriteConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "signature": {"hash": {"$binary": {"base64": "AFS2+t9qfcWNvqCjKlsUGd79ZVE=", "subType": "00"}}, "keyId": 6875719810232090627}}}; nested exception is com.mongodb.MongoCommandException: Command failed with error 112 (WriteConflict): 'WriteConflict' on server mongo3:27013. The full response is {"errorLabels": ["TransientTransactionError"], "operationTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "ok": 0.0, "errmsg": "WriteConflict", "code": 112, "codeName": "WriteConflict", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1614181856, "i": 71}}, "signature": {"hash": {"$binary": {"base64": "AFS2+t9qfcWNvqCjKlsUGd79ZVE=", "subType": "00"}}, "keyId": 6875719810232090627}}}
We looked through the audit log via Metabase for the correlation id:
This query of the audit log shows the failed message (Metabase ID: 603675e55fca3b29ce78495c) and subsequent messages.
The delay added by the retry is visible in the message properties:
... { "reason": "expired", "count": 1, "exchange": "ex.error", "time": "2021-02-24T15:51:01Z", "routing-keys": [ "esr.porpos.split" ], "queue": "q.error.delay" }, ... "timestamp": "2021-02-24T15:50:56Z" ...
subsequent messages from the “position saved” message can be seen in the audit log.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213