...
- Informed of issue via PO in Slack.
Resolution
- Paul restarted, solved initial problem.
- Paul Restart the app to restore Prod initially
- Chris / Paul / others now investigating underlying cause.
- .
Detection / Timeline
- 09.52 2018-10-03 Wed: Chris spotted TCS connection issue on Prod resulting in it going down.
- 09.55 2018-10-03 Wed: Chris restarted the app to get it back up and running.
- 15.17 2018-10-03 Wed: #monitoring Slack channel threw an alert on Prod (G) that TCS failed a healthcheck (and then 10 mins later threw the same alert on Prod (B)
- 15.19 2018-10-03 Wed: Chris discovered Prod was down when investigating reports from Panos that Dev was down (unrelated expected temporary issue)went down.
- 15.22 2018-10-03 Wed: Chris brought Prod back up.
- 09.15 2018-10-04 Thu: Alistair reported users having problem again with Prod going down (no healthcheck alert in #monitoring).
- 09.30 2018-10-04 Thu: Paul restarted the app to restore Prod.
- 09.31 2018-10-04 Thu: Paul began investigating root cause, commenting that person list view page is being heavily hit (but not obviously by anything malicious). Chris suggested further investigation into altering sizes and timeouts but we should also look into the queries I mentioned as 8/9 second for a simple search is lengthy (which would prompt an impatient user to refresh their screen a lot, perhaps).
...
Where we got lucky
- .
Supporting information
During investigation, discovered first connection problem occurred on Tue 2 Oct, but App recovered itself.