Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Reuben Roberts

Status

Documenting

Summary

MongoDB cluster went down: https://hee-tis.atlassian.net/browse/TIS21-2535

Impact

No end user impact. The database that holds the information for communicating with ESR was unavailable for approximately 30min and the integration was paused during that time.

Non-technical Description

  • The MongoDB database that supports the dialogue with ESR failed. When the services resumed, the pending events from TIS, e.g. updates to personal details, were processed.

...

Trigger

  • Currently unknown: presumably, the database service became overloaded, though no out-of-memory errors were logged.

Detection

  • Slack Alert at 13:13 on

...

Resolution

  • The server was restarted and started functioning accordingly

...

Timeline

  • 13:13 - Alert on Slack: AWS Service 10.170.0.151:18080 is down.

  • 13:17:21 - Docker reports mongo2 container is unhealthy (syslog: Jan 6 13:17:21 ip-10-170-0-151 dockerd[497]: time="2022-01-06T13:13:04.991689589Z" level=warning msg="Health check for container 971e3085ffb867b27e4909c42281e79bacff535976c05463ff5674b43d97b683 error: context deadline exceeded")

  • 13:22 and 13:28 respectively - Docker reports mongo1 and mongo3 containers are unhealthy, as per above.

  • ~13:37 - Server rebooted

  • 13:38:01 - Server Mongo instances log that ‘MongoDB starting’

  • 13:38 - Alert on Slack that connection is restored

...

Root Cause(s)

  • .

...

Action Items

Action Items

Owner

Done: Use the same (T3) EC2 instances for Production as are currently used for Staging MongoDB

John Simmons (Deactivated)

...

Lessons Learned