Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed) |
Status | In progress |
Summary | On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday |
Impact | No Stage. No Prod. No data syncing in various places |
Timeline (collation and interpretation from across all Slack monitoring channels)
First trigger
2020-10-16 06:33 ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot)
Starting Friday and continuing over the weekend
2020-10-16 07:18 Staging RabbitMQ node 2 down (PH)
2020-10-16 07:38 Prod ES node 3 down (PH)
2020-10-16 07:38 Prod ES node 1 & 2 down - additional alert of too few nodes running - at this point, prod person search should not be working (PH)
2020-10-16 07:58 Staging ES node 2 down (PH)
2020-10-16 08:43 Phil W asks what this all means, Phil J summaries (PH)
2020-10-16 12:03 Old concerns on green&blue stage goes down (PH)
2020-10-16 16:38 Jenkins goes down (PH)
Then
2020-10-17 01:13 high messages in RabbitMQ Prod (PH)
2020-10-17 01:28 high messages in RabbitMQ Staging (PH)
2020-10-17 06:00 ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09
2020-10-17 06:00 ESR ETL
2020-10-17 07:08 Staging ES node 2 down, Prod RabbitMQ node 3 down (PH)
2020-10-17 07:18 Staging ES node 1 & 3 down (PH)
2020-10-17 07:33 Staging RabbitMQ node 1 & 3 down (PH)
2020-10-17 07:43 Prod Mongo goes down (PH)
…at this point ☝ pretty much everything is being effected by the combination of issues
2020-10-17 08:48 (see Prometheus graph below)
2020-10-17 09:27 (see Prometheus graph below)
2020-10-17 10:23 D/B Prod/Stage sync started but never completed
2020-10-17 10:23 NDW ETL: Stage (PaaS) failed
2020-10-18 02:42 ESR Sentry errors x 7 (reappearance of the same issue across all services)
2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers
2020-10-18 10:23 NDW ETL: Stage (PaaS) failed
2020-10-18 10:25 NDW ETL: Stage (current) failed
2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers
2020-10-19 01:29 TCS ES Person sync job failed (None of the configured nodes were available)
2020-10-19 07:46 Users started reporting problems using Search on Prod
2020-10-19 08:59 Users reporting problems using Search on Prod had been resolved
2020-10-19 07:54 (see Prometheus graph below)
2020-10-19 08:17 (see Prometheus graph below)
2020-10-19 10:35 (see Prometheus graph below)
2020-10-19 massive Sentry hit, on ESR, using up our entire monthly allocation
2020-10-20 07:30 Person Placement Employing Body Trust job failed to run/complete on either blue or green servers
Root Cause(s)
Dependabot does a monthly check on tooling versions and auto-generates PRs and builds which, in the case of the ESR area, creates many many concurrent containers which trips Jenkins up (not enough RAM)
Lots and lots of downstream processes rely on Jenkins being up
We haven’t been able to configure the monthly Dependabot sweep to avoid a massive concurrent hit of PRs / builds
Trigger
Configuration of Dependabot inadvertently puts strain on Jenkins
Dependencies of many timed jobs on Jenkins being available
Not enough configuration of retries
Resolution
Manually brought Jenkins back up
Manually restarted everything that had been affected by it being down
Detection
Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution
Bad news is we didn’t act on the initial Friday indicators, and John was rebuilding his machine over the weekend, so couldn’t do his normal knight in shiny armour stuff!
Action Items
Confirmed action Items | Owner | Status | Comments |
---|---|---|---|
1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months? | Ops | Done | |
| ESR & others | To do | |
3. Close outstanding ESR major version change PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Ignore minor versions. | Sachin | To do | Consider ways to reduce system resources instead? |
4. Restrict the number of PRs Dependabot opens on each ESR project to 1 and to major versions only (given they’re microservices, we still might get significant numbers). | AndyD | To do | |
5. Use Jenkins pipeline rather than node - to make things sequential rather than concurrent | AndyD / Pepe | To do | Pipelines not written to take advantage of extra nodes - is the time investment required greater than the benefit |
6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process). | ESR & perm team | To do | Spike ticket completed this Sprint. This ticket is the result of that Spike. |
7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day. | Pepe | To do | TBC |
Lessons Learned (Good and Bad)
Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.
Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).
Incident has encouraged us to identify the route cause, and to identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.
Techy Stuff
5 whys
Everything fell over | |
---|---|
|
|
Initial discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No:
|
Tech notes
Related to TISNEW-5454
ESR Data Exporter changed - tried to build all containers - tried to fire up all the outstanding Dependabot (monthly) PRs (each PR spun up 30 odd containers).
Jenkins box only has 16Gb of RAM - can we bung up to 32? But this doubles the cost. Can we hit the route cause, rather than the size of RAM.
Turn off dependabot for now? Automatic rebasing? Not keen to switch off a valuable resource. Better to address root cause.
Modify jobs to only build if, e.g.:
PR was raised by dependabot (and is a major version change)
There are other PRs for the project building
Single Jenkins node responsibility for control and building everything.
We haven’t moved to GHA yet.
Person Sync (rather than Person Owner?) Job:
Prod-to-Stage didn’t failed on Saturday
First a problem with Elastic Search Sync job on Sunday
Docker restart made elastic-search available to ADMINS-UI again
Add Comment