Troubleshoot MongoDb ReplicaSet problems
Initial Problem
Reploy of EsrAuditService - the healthcheck on DEV failed.
https://build.tis.nhs.uk/jenkins/job/HEE/job/TIS-EsrAuditService/job/master/
Investigation
To investigate further I ssh’d into the blue DEV Azure VM to investigate further.
ssh 10.150.0.137
I then looked at the tail of the log fail for the esr audit microservice.
docker logs esrauditservice_esrinbounddatareaderservice_1 --tail 100 -f
The log entries showed the following (code block is scrollable!)
2020-04-01 08:47:33.386 INFO 1 --- [150.0.151:27013] org.mongodb.driver.cluster : Server 10.150.0.151:27013 does not appear to be a member of an initiated replica set.
2020-04-01 08:47:33.387 INFO 1 --- [150.0.151:27012] org.mongodb.driver.cluster : Server 10.150.0.151:27012 does not appear to be a member of an initiated replica set.
2020-04-01 08:47:33.388 INFO 1 --- [150.0.151:27011] org.mongodb.driver.cluster : Server 10.150.0.151:27011 does not appear to be a member of an initiated replica set.
2020-04-01 08:47:33.888 INFO 1 --- [150.0.151:27013] org.mongodb.driver.cluster : Server 10.150.0.151:27013 does not appear to be a member of an initiated replica set.
The last bit of the error
'Server 10.150.0.151:27012 does not appear to be a member of an initiated replica set.'
points at a problem a problem with the mongod replicaset.
Remember - the replicaset is the 3 mongo nodes acting as a cluster - when the replicaset is correctly setup - it should have 1 primary and 2 secondaries.
So to check the state of the replicaset for DEV
mongo with replica set vm name | address |
---|---|
HEE-TIS-VM-DEV-MONGO-DB-REPLICASET | 10.150.0.151 |
HEE-TIS-VM-STAGE-MONGO-DB-REPLICASET | 10.160.0.151 |
HEE-TIS-VM-PROD-MONGO-DB-REPLICASET | 10.170.0.151 |
ssh 10.150.0.151
I then entered the ‘shell’ on the first of the mongo nodes, mongo1.
docker exec -it mongo1 /bin/bash
This should bring up
root@mongo1:/#
You are now running bash inside the mongo1 container. From here we can get into the mongo console and check on the replicaset.
mongo "mongodb://muser:<PASSWORD>@mongo1:27011/esrreconciliation?authSource=admin&replicaSet=rs0"
Note: you will need to replace <PASSWORD>
with the actual mongo admin password!
Instead of getting a normal mongo prompt, we got errors about the replicaset.
2020-04-01T08:59:33.325+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:33.325+0000 I NETWORK [js] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 10 checks in a row.
2020-04-01T08:59:33.826+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:33.826+0000 I NETWORK [js] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 11 checks in a row.
2020-04-01T08:59:34.326+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:34.827+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:35.327+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:35.827+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:36.328+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:36.828+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:37.329+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:37.829+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:38.330+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:38.830+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:38.830+0000 I NETWORK [js] Cannot reach any nodes for set rs0. Please check network connectivity and the status of the set. This has happened for 21 checks in a row.
2020-04-01T08:59:39.330+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:39.831+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:40.331+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:40.832+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:41.332+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:41.832+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:42.333+0000 W NETWORK [js] Unable to reach primary for set rs0
a2020-04-01T08:59:42.833+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:43.334+0000 W NETWORK [js] Unable to reach primary for set rs0
2020-04-01T08:59:43.334+0000 E QUERY [js] Error: connect failed to replica set rs0/mongo1:27011 :
Again, this looks like there’s a problem with the replicaset named ‘rs0’. Next step, try and connect to the mongodb using the console but NOT connecting as a replicaset, so..
mongo "mongodb://muser:<PASSWORD>@mongo1:27011/esrreconciliation?authSource=admin"
I was able to log into without seeing errors, which enabled me to check on the replicaset using ‘db.isMaster
'.
> db.isMaster()
{
“ismaster” : false,
“secondary” : false,
“info” : “Does not have a valid replica set config”,
<----- ERROR “isreplicaset” : true,
“maxBsonObjectSize” : 16777216,
“maxMessageSizeBytes” : 48000000,
“maxWriteBatchSize” : 100000,
“localTime” : ISODate(“2020-04-01T09:03:26.058Z”),
“logicalSessionTimeoutMinutes” : 30,
“minWireVersion” : 0,
“maxWireVersion” : 7,
“readOnly” : false,
“ok” : 1
}
Reading the json result of ‘db.isMaster
’, the ismaster:false
and secondary:false
- shows the replicaset is not healthy ( if you’re connected to a node in a healthy replicaset - you’d expect to see one of these have value true - as you’d be connected to the primary or a secondary).
I think the fourth line ‘isreplicaset:true
’ - just says that this node is willing to be part of a replicaset - not that it is in a replicaset.
The third line “info” : “Does not have a valid replica set config” - points to a problem with the replicaset configuration.
Fix
The way to set the replicaset configuration is via the replica set initiate command. So it seemed sensible to reset the configuration and re-initialise the replicaset. The command is:
rs.initiate({
_id: 'rs0',
members: [
{_id:0, host:"mongo1:27011"},
{_id:1, host:"mongo2:27012"},
{_id:2, host:"mongo3:27013"}
]
});
After running this command - the command prompt changed to ‘rs0:PRIMARY>
’ which is usually a good sign that the replicaset is happy.
I was then able to run ‘db.isMaster
’ again.
( I truncated the verbose output - only this top bit is required here. )
So we can see now that there are 3 nodes in the cluster, and the mongo1 is the primary. All looks good.
This fixed the problem with the replicaset on DEV although it was not clear why the problem occurred.
After the fix, i was able to redeploy the 'EsrAuditService' on Dev, Stage and Prod successfully.
==
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213