2022-01-06 ESR MongoDB cluster down

Date

Jan 6, 2022

Authors

@Reuben Roberts @John Simmons (Deactivated)

Status

Done

Summary

MongoDB cluster went down: https://hee-tis.atlassian.net/browse/TIS21-2535

Impact

No end user impact. The database that holds the information for communicating with ESR was unavailable for approximately 30min and the integration was paused during that time.

Non-technical Description

  • The MongoDB database that supports the dialogue with ESR failed. When the services resumed, the pending events from TIS, e.g. updates to personal details, were processed.


Trigger

  • Currently unknown: this particular server has suffered similar issues in the past. The database service itself did not obviously became overloaded, as no out-of-memory errors were logged, but the server became unresponsive and had to be rebooted.

Detection

  • Slack Alert at 13:13 on Jan 6, 2022


Resolution

  • The server was restarted and started functioning accordingly.


Timeline

  • Jan 6, 2022 13:11:02 - ESR Data export service reports 2022-01-06 13:11:02.773 INFO 1 --- ['}-mongo2:27012] org.mongodb.driver.cluster : Exception in monitor thread while connecting to server mongo2:27012

    com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message while generating APP records. Similar errors for the other mongo servers in the cluster occurred immediately thereafter.

  • Jan 6, 2022 13:13 - Alert on Slack: AWS Service 10.170.0.151:18080 is down.

  • Jan 6, 2022 13:17:21 - Docker reports mongo2 container is unhealthy (syslog: Jan 6 13:17:21 ip-10-170-0-151 dockerd[497]: time="2022-01-06T13:13:04.991689589Z" level=warning msg="Health check for container 971e3085ffb867b27e4909c42281e79bacff535976c05463ff5674b43d97b683 error: context deadline exceeded")

  • Jan 6, 2022 13:22 and 13:28 respectively - Docker reports mongo1 and mongo3 containers are unhealthy, as per above.

  • Jan 6, 2022 ~13:37 - Server rebooted

  • Jan 6, 2022 13:38:01 - Server Mongo instances log that ‘MongoDB starting’

  • Jan 6, 2022 13:38 - Alert on Slack that connection is restored


Root Cause(s)

  • Logs for the three mongo containers (mongo1, mongo2 and mongo3) were reviewed, but no particular error was noted.

  • Server syslogs, AWS metrics and Prometheus monitoring statistics for the server were reviewed for abnormal load or other errors. None were noted. CPU usage at the point of the incident was high (~75%) but not extreme.

  • Logs for ESR Data Export Service on Prod Blue reflect the error in connecting to the mongo database while generating APP records.

  • Since the server itself became unresponsive, the issue was not limited to Docker container health, and may suggest resource deadlocking. The containers do not currently have resource limits configured, so their aggregate memory usage might have increased to the point where the server could not allocate memory to other key services.


Action Items

Action Items

Owner

Action Items

Owner

Done: Use the same (T3) EC2 instances for Production as are currently used for Staging MongoDB

@John Simmons (Deactivated)

Explore resource limits for the cluster as a whole, or per-container

@John Simmons (Deactivated)

Implement detailed monitoring on the server

@John Simmons (Deactivated)

 

 


Lessons Learned

  • Consider prioritising shift to managed mongo service (e.g. Atlas)

  • Resource management on servers hosting multiple containers (with different needs) is worth investigating, though non-trivial.