Date | |
Authors | Andy Nash (Unlicensed) / Paul Hoang (Unlicensed) |
Status | Ongoing investigation |
Summary | TCS calling D/B to find no connections available |
Impact | Prod goes down (when TCS is down, everything is down) |
Jira reference
- TISNEW-2150Getting issue details... STATUS .
Impact
- TIS app unusable by anyone.
Root Causes
- .
Trigger
- Informed of issue via PO in Slack.
Resolution
- Paul restarted, solved initial problem.
- Paul now investigating underlying cause.
- .
Detection / Timeline
- 15.17 2018-10-03 Wed: #monitoring Slack channel threw an alert on Prod (G) that TCS failed a healthcheck (and then 10 mins later threw the same alert on Prod (B)
- 15.19 2018-10-03 Wed: Chris discovered Prod was down when investigating reports from Panos that Dev was down (unrelated expected temporary issue).
- 15.22 2018-10-03 Wed: Chris brought Prod back up.
- 09.15 2018-10-04 Thu: Alistair reported users having problem again with Prod going down (no healthcheck alert in #monitoring).
- 09.30 2018-10-04 Thu: Paul restarted to restore Prod.
- 09.31 2018-10-04 Thu: Paul began investigating root cause, commenting that person list view page is being heavily hit (but not obviously by anything malicious). Chris suggested further investigation into altering sizes and timeouts but we should also look into the queries I mentioned as 8/9 second for a simple search is lengthy (which would prompt an impatient user to refresh their screen a lot, perhaps).
Action Items
- .
Lessons Learned
- .
What went well
- .
What went wrong
- .
Where we got lucky
- .
Supporting information
.
Add Comment