Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Current »

Date
  
AuthorsAndy Nash (Unlicensed) / Paul Hoang (Unlicensed)
StatusOngoing investigation
SummaryTCS calling D/B to find no connections available
ImpactProd goes down (when TCS is down, everything is down)

Jira reference

TISNEW-2150 - Getting issue details... STATUS

Impact

  • TIS app unusable by anyone.

Root Causes

  • .

Trigger

  • Informed of issue via PO in Slack.

Resolution

  • Paul restarted, solved initial problem.
  • Paul now investigating underlying cause.
  • .

Detection / Timeline

  • 15.17 2018-10-03 Wed: #monitoring Slack channel threw an alert on Prod (G) that TCS failed a healthcheck (and then 10 mins later threw the same alert on Prod (B)
  • 15.19 2018-10-03 Wed: Chris discovered Prod was down when investigating reports from Panos that Dev was down (unrelated expected temporary issue).
  • 15.22 2018-10-03 Wed: Chris brought Prod back up.
  • 09.15 2018-10-04 Thu: Alistair reported users having problem again with Prod going down (no healthcheck alert in #monitoring).
  • 09.30 2018-10-04 Thu: Paul restarted to restore Prod.
  • 09.31 2018-10-04 Thu: Paul began investigating root cause, commenting that person list view page is being heavily hit (but not obviously by anything malicious). Chris suggested further investigation into altering sizes and timeouts but we should also look into the queries I mentioned as 8/9 second for a simple search is lengthy (which would prompt an impatient user to refresh their screen a lot, perhaps).

Action Items

  • .

Lessons Learned

  • .

What went well

  • .

What went wrong

  • .

Where we got lucky

  • .

Supporting information

  • .

  • No labels