Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed)

StatusIn progress

LiveDefect resolved. Actions being ticketed up (Andy Nash (Unlicensed))

Summary

On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday

Impact

No Stage. No Prod. No data syncing in various places

...

Everything fell over

Comments following catch up 2020-10-21

  1. Jenkins

  2. ESR containers taking up all the resource

  3. Too many (Dependabot) PRs outstanding, builds, rebasing

  4. ESR did not had time to action them because of the launch of new world code

  1. Underlying OS upgrade occurred and was applied (LivePatch)

  2. Probably something to do with that very specific update (it’s never happened before and has been running since the Apps were built

  3. Did Docker struggle with the patch, rather than Amazon making a mistake with applying the patch (no complaints on Amazon forums, so looks quite specific to us)

  4. Version of Docker is linked to whatever’s available in apt

  5. Not worth changing our set up for the sake of a ‘freak’ occurrence that probably will only happen once every 3 years? Or switching off the OS upgrades and handling it manually every week / switch off everything and restart it regularly?

Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue to be investigated and confirmed…

Destabilised the system, stopping Docker. When Docker restarted, everything started coming back up again.

Initial discussion, along with short and longer term actions

What can we do about Dependabot creating and building simultaneously?

Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.

We could get Dependabot to add a GitHub label to the PR - add something to the Jenkins file to read the label and mark as “Don’t run”. But this stops Dependabot being useful.

ESR preoccupied with launching New World, understandably!

Can perm team keep on top of ESR stuff when they leave?

Even when keeping on top of things, will it eventually be too much anyway?
Or is it simply a case of the team not controlling the overall number of open PRs?

Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing

Is Jenkins the right tool for everything it’s being asked to do? No:

  1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

  2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')

  3. Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs

  4. Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.

  5. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

  6. move ETLs over to ECS tasks (serverless ‘run container' instructions to AWS - not reliant on our infrastructure).
    This would remove the dependency on Jenkins - so if it went down, the jobs could continue.
    Don’t do scheduled jobs / anything with a timer - use Cron server instead for this stuff.
    Just use Jenkins as a build server (Metabase also runs on Jenkins, but doesn’t use much)

  7. ticket up addressing our infrastructure so that the set up ESR have created does run - it’s been done right!

  8. get ourselves a dedicated Jenkins server (what size (question))

  9. move to ElasticSearch SaaS

...