Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Date

Authors

Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed)

Status

In progress

Summary

On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing loads of other weekend timed jobs to fall over

Impact

No Stage. No Prod. No data syncing in various places

Timeline

  • 2020-10-16 06:33 (question) ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot)

  • 2020-10-17 06:00 ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09

  • 2020-10-17 06:00 ESR ETL (question)

  • 2020-10-17 08:48 (question) (see Prometheus graph below)

  • 2020-10-17 09:27 (question) (see Prometheus graph below)

  • 2020-10-17 10:23 D/B Prod/Stage sync started but never completed

  • 2020-10-17 10:25 Jenkins service registry failure

  • 2020-10-17 10:23 NDW ETL: Stage (PaaS) failed

  • 2020-10-18 02:42 ESR Sentry errors x 7 (reappearance of the same issue across all services)

  • 2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers

  • 2020-10-18 10:23 NDW ETL: Stage (PaaS) failed

  • 2020-10-18 10:25 NDW ETL: Stage (current) failed

  • 2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers

  • 2020-10-19 01:29 TCS ES Person sync job failed (None of the configured nodes were available)

  • 2020-10-19 07:46 Users started reporting problems using Search on Prod

  • 2020-10-19 08:59 Users reporting problems using Search on Prod had been resolved

  • 2020-10-19 07:54 (question) (see Prometheus graph below)

  • 2020-10-19 08:17 (question) (see Prometheus graph below)

  • 2020-10-19 10:35 (question) (see Prometheus graph below)

  • 2020-10-19 (question) massive Sentry hit, on ESR, using up our entire monthly allocation

  • 2020-10-20 07:30 Person Placement Employing Body Trust job failed to run/complete on either blue or green servers

Root Cause(s)

  • Dependabot does a monthly check on tooling versions and auto-generates PRs and builds which, in the case of the ESR area, creates many many concurrent containers which trips Jenkins up (not enough RAM)

  • Lots and lots of downstream processes rely on Jenkins being up

  • We haven’t been able to configure the monthly Dependabot sweep to avoid a massive concurrent hit of PRs / builds

Trigger

  • Configuration of Dependabot to put less concurrent strain on Jenkins (question)

  • Dependencies of many timed jobs on Jenkins being available (question)

  • Not enough configuration of retries (question)

Resolution

  • Manually brought Jenkins back up (question)

  • Manually restarted everything that had been affected by it being down (question)

Detection

  • Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution

  • Bad news is we didn’t act on the initial Friday indicators, and John was rebuilding his machine over the weekend, so couldn’t do his normal knight in shiny armour stuff!

Action Items

Action Items

Owner

1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

Ops

2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')

ESR

3. Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs

ESR

4. Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.

AndyD (question)

5. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

Pepe (question)

Lessons Learned (Good and Bad)

Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.

  • Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).

  • Incident has encouraged us to identify the route cause, and to identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.


Techy Stuff

5 whys

  1. Everything fell over

  2. Jenkins

  3. ESR containers taking up all the resource

  4. Too many (Dependabot) PRs outstanding, builds, rebasing

  5. ESR did not had time to action them because of the launch of new world code

What can we do about Dependabot creating and building simultaneously?

Dependabot does run sequentially, but much faster than Jenkins can process things

We could get Dependabot to add a GitHub label to the PR - add something to the Jenkins file to read the label and mark as “Don’t run”. But this stops Dependabot being useful.

ESR preoccupied with launching New World

Can perm team keep on top of ESR stuff when they leave?

Even when keeping on top of things, will it eventually be too much anyway? Or is it simply a case of the team not controlling the overall number of open PRs?

Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing

Is Jenkins the right tool for everything it’s being asked to do? NOOOOOO:

  • 1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

  • 2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')

  • 3. Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs

  • 4. Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.

  • 5. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

  • 6. move ETLs over to ECS tasks (serverless 'run container' instructions to AWS - not reliant on our infrastructure). This would remove the dependency on Jenkins - so if it went down, the jobs could continue.
    Don’t do scheduled jobs / anything with a timer - use Cron server instead for this stuff.
    Just use Jenkins as a build server

  • 7. ticket up addressing our infrastructure so that the set up ESR have created does run - it’s been done right!

  • 8. get ourselves a dedicated Jenkins server (what size (question))

  • 9. move to ElasticSearch-aa-S

  • Related? TISNEW-5454

  • ESR Data Exporter changed - tried to build all containers - tried to fire up all the outstanding Dependabot (monthly) PRs (each PR spun up 30 odd containers).

  • Jenkins box only has 16Gb of RAM - can we bung up to 32? But this doubles the cost. Can we hit the route cause, rather than the size of RAM.

  • Turn off dependabot for now? Automatic rebasing?

  • Modify jobs to only build if, e.g.:

  • a) PR was raised by dependabot

  • b) There are other PRs for the project building

  • Single Jenkins node responsibility for control and building everything.

  • We haven’t moved to GHA yet.

Person Sync (rather than Person Owner?) Job:

  • Prod-to-Stage didn’t failed on Saturday

  • First a problem with Elastic Search Sync job on Sunday

  • Docker restart made elastic-search available to ADMINS-UI again

Graphs

  • No labels