2018-10-03/04 TCS D/B connectivity problems

Date
  
AuthorsAndy Nash (Unlicensed) / Paul Hoang (Unlicensed)
StatusOngoing investigation
SummaryTCS calling D/B to find no connections available
ImpactProd goes down (when TCS is down, everything is down)

Jira reference

TISNEW-2150 - Getting issue details... STATUS

Impact

  • TIS app unusable by anyone.

Root Causes

  • .

Trigger

  • Informed of issue via PO in Slack.

Resolution

  • Restart the app to restore Prod initially
  • Chris / Paul / others now investigating underlying cause.
  • .

Detection / Timeline

  • 09.52 2018-10-03 Wed: Chris spotted TCS connection issue on Prod resulting in it going down.
  • 09.55 2018-10-03 Wed: Chris restarted the app to get it back up and running.
  • 15.17 2018-10-03 Wed: #monitoring Slack channel threw an alert on Prod (G) that TCS failed a healthcheck (and then 10 mins later threw the same alert on Prod (B)
  • 15.19 2018-10-03 Wed: Prod went down.
  • 15.22 2018-10-03 Wed: Chris brought Prod back up.
  • 09.15 2018-10-04 Thu: Alistair reported users having problem again with Prod going down (no healthcheck alert in #monitoring).
  • 09.30 2018-10-04 Thu: Paul restarted the app to restore Prod.
  • 09.31 2018-10-04 Thu: Paul began investigating root cause, commenting that person list view page is being heavily hit (but not obviously by anything malicious). Chris suggested further investigation into altering sizes and timeouts but we should also look into the queries I mentioned as 8/9 second for a simple search is lengthy (which would prompt an impatient user to refresh their screen a lot, perhaps).

Action Items

  • .

Lessons Learned

  • .

What went well

  • .

What went wrong

  • .

Where we got lucky

  • .

Supporting information

  • During investigation, discovered first connection problem occurred on Tue 2 Oct, but App recovered itself.