Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) |
Status | In progress |
Summary | On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing loads of other weekend timed jobs to fall over |
Impact | No Stage. No Prod. No data syncing in various places |
...
2020-10-16 06:33 ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot)
2020-10-17 06:00 ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09
2020-10-17 06:00 ESR ETL
2020-10-17 08:48 (see Prometheus graph below)
2020-10-17 09:27 (see Prometheus graph below)
2020-10-17 10:23 D/B Prod/Stage sync started but never completed
2020-10-17 10:23 NDW ETL: Stage (PaaS) failed
2020-10-18 02:42 ESR Sentry errors x 7 (reappearance of the same issue across all services)
2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers
2020-10-18 10:23 NDW ETL: Stage (PaaS) failed
2020-10-18 10:25 NDW ETL: Stage (current) failed
2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers
2020-10-19 01:29 TCS ES Person sync job failed (None of the configured nodes were available)
2020-10-19 07:46 Users started reporting problems using Search on Prod
2020-10-19 08:59 Users reporting problems using Search on Prod had been resolved
2020-10-19 07:54 (see Prometheus graph below)
2020-10-19 08:17 (see Prometheus graph below)
2020-10-19 10:35 (see Prometheus graph below)
2020-10-19 massive Sentry hit, on ESR, using up our entire monthly allocation
2020-10-20 07:30 Person Placement Employing Body Trust job failed to run/complete on either blue or green servers
Prime Timeline - according to the monitoring channel
2020-10-16 (Friday) 07:18 Staging RabbitMQ node 2 down
2020-10-16 (Friday) 07:38 Prod ES node 3 down
2020-10-16 (Friday) 07:38 Prod ES node 1 & 2 down - additional alert of too few nodes running - at this point, prod person search should not be working
2020-10-16 (Friday) 07:58 Staging ES node 2 down
2020-10-16 (Friday) 08:43 Phil W asks what this all means, Phil J summaries
2020-10-16 (Friday) 12:03 Old concerns on green&blue stage goes down
2020-10-16 (Friday) 16:38 Jenkins goes down
Same alerts continue over the weekend and ETL failures occur because ES is down
2020-10-17 (Saturday) 01:13 high messages in RabbitMQ Prod
2020-10-17 (Saturday) 01:28 high messages in RabbitMQ Staging
2020-10-17 (Saturday) 07:08 Staging ES node 2 down, Prod RabbitMQ node 3 down
2020-10-17 (Saturday) 07:18 Staging ES node 1 & 3 down
2020-10-17 (Saturday) 07:33 Staging RabbitMQ node 1 & 3 down
2020-10-17 (Saturday) 07:43 Prod Mongo goes down
can’t be bothered to go through any more alerts, everything is broken at this point
...
Everything fell over | |
---|---|
|
|
Discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No:
|
...