Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed)

Status

In progress

Summary

On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing loads of other weekend timed jobs to fall over

Impact

No Stage. No Prod. No data syncing in various places

Timeline

...

2020-10-16 06:33 (question) ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot)

...

2020-10-17 06:00 ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09

...

2020-10-17 06:00 ESR ETL (question)

...

2020-10-17 08:48 (question) (see Prometheus graph below)

...

2020-10-17 09:27 (question) (see Prometheus graph below)

...

Date

Authors

Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed) Sachin Mehta (Unlicensed)

Status

LiveDefect resolved. Actions being ticketed up (Andy Nash (Unlicensed))

Summary

On Friday evening we noticed the server used to build our applications and run scheduled jobs was struggling. It then crashed, subsequently causing ETL (scheduled data flows between systems) and data related issues. Systems used to search, transfer and display data in TIS along with data stores then also froze between Friday and Saturday

Impact

Users unable to search for trainees.

Staging environment was not updated with production data on Saturday as scheduled

No data synchronisation between systems

Timeline

0600 Friday 16th October 2020

We have a system to automatically upgrade versions of software, and keep these versions and dependencies between systems synchronised. This ensures software is patched and up to date. A large number of these upgrades were triggered at one time which caused our build server to build the application multiple times in order to run the automated tests against them prior to deployment.

0630 Friday 16th October 2020

One of our search servers failed

0730 Friday 16th October 2020

More search server failures

Stage environment suffers various server failures

1630 Friday 16th October

Build server fails

0110 Saturday 17th October

High number of messages in ESR system waiting to be processed

0600 Saturday 17th October

ESR scheduled job process failed to start

0800 Saturday 17th October

Multiple server failures by this point and TIS search non-operational

Root Causes

  • An update to the operating system was automatically applied without any need to restart servers

  • This caused a conflict of versions with our containerisation platform that we use to host our applications. This is the first time we have seen this since TIS went live so we need to assess if this is a one-off occurrence

  • The trainee data used for our search function was unavailable (we use elasticsearch but this was unreachable)

  • The version conflict caused a domino effect which resulted in multiple server failures. The was compounded by our automated dependency system (Dependabot) which created multiple versions of the application which caused the build server to fail (not enough RAM)

  • Many downstream processes rely on the build server

  • We haven’t configured the monthly Dependabot sweep to stagger the automated builds

Trigger

  • 6am patch caused a version conflict which required a restart of the platform running our applications (Docker)

  • Failure of Docker to restart put pressure on the build server as well as many other servers

  • ElasticSearch issues compromised the search function for users of the app

  • Dependencies of many timed jobs on the build server being available

Resolution

  • Restarted build server

  • Restarted National Data Warehouse ETLs on Stage and Prod environments

  • Restarted Docker on search server

  • Restarted Neo4J GraphDB

  • Turned on Change Data Capture system to deal with backlog of data changes from TIS to ESR

  • Removed all the old images from MongoDB, one of our databases

Detection

  • Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution

  • Users pointed out the issue with search not functioning correctly on TIS

  • Bad news is we didn’t act on the initial Friday indicators (they didn’t appear to be as serious an indicator as they were)

Actions

  • Look into being able to do a full restart automatically if a similar situation arises

  • Increase the memory (RAM) on the build server

  • Implement and close any remaining patches to ESR which were automatically created

  • Restrict the number of dependencies that can be upgraded at one time to reduce the load on the build server

  • Make changes to build server to move from concurrent to sequential builds

  • Look at removing the nightly job that refreshes search data as this may no longer be necessary

Lessons Learned (Good and Bad)

  • Dev team to check the monitoring channels more regularly

  • Act on anything unusual (resolve immediately if easy to do so, alert others immediately if it appears serious, raise on stand ups otherwise).

  • Incident has encouraged us to do the most thorough route cause analysis we have ever done, and subsequently identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.

...

Technical Detail

Summary

On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday

Timeline (collation and interpretation from across all Slack monitoring channels)

First trigger

  • c. 6am

  • 2020-10-16 06:33

  • 2020-10-16 06:36

  • 2020-10-16 06:42

  • 2020-10-16 06:??

Friday

  • Patch comes out for the kernel

  • ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot - upgrade at Git repo level)

  • one ES cluster node failed

  • amazon-ssm-agent kicked off an apt upgrade (at OS level): which included xxd, python 2.7, vim, python 2.6, linux AWS 5.4 headers.

  • old versions of xxd, python 2.7, vim, python 2.6, linux AWS 5.4 headers were removed - the upgrades were auto-applied without a restart via LivePatch.

Starting Friday and continuing over the weekend

  • 2020-10-16 07:18

  • 2020-10-16 07:38

  • 2020-10-16 07:38

  • 2020-10-16 07:58

  • 2020-10-16 08:43

  • 2020-10-16 12:03

  • 2020-10-16 16:38

Friday

  • Staging RabbitMQ node 2 down (PH)

  • Prod ES node 3 down (PH)

  • Prod ES node 1 & 2 down - additional alert of too few nodes running - at this point, prod person search should not be working

  • Staging ES node 2 down (PH)

  • Phil W asks what this all means, Phil J summaries (PH)

  • Old concerns on green&blue stage goes down (PH)

  • Jenkins goes down (PH)


because of sharding, users may have been seeing only partial results at this point) (PH) 👈 need to look into and confirm this assumption…

Then

  • 6am

  • 2020-10-17 01:13

  • 2020-10-17 01:28

  • 2020-10-17 06:00

  • 2020-10-17 06:00

  • 2020-10-17 06:02

  • 2020-10-17 07:08

  • 2020-10-17 07:18

  • 2020-10-17 07:33

  • 2020-10-17 07:43

Saturday

  • Patch comes out for the kernel

  • High messages in RabbitMQ Prod (PH)

  • High messages in RabbitMQ Staging (PH)

  • ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09

  • ESR ETL (question)

  • ES Docker container failure (see Sachin’s snippet on Slack channel)

  • Staging ES node 2 down, Prod RabbitMQ node 3 down (PH)

  • Staging ES node 1 & 3 down (PH)

  • Staging RabbitMQ node 1 & 3 down (PH)

  • Prod Mongo goes down (PH)

…at this point ☝ pretty much everything is being effected by the combination of issues

👇 shall I remove this section of the timeline completely from this incident log? Seems overkill, given pretty much everything was down at this point

  • 2020-10-17 08:48 (question)

  • 2020-10-17 09:27 (question)

  • 2020-10-17 10:23

  • 2020-10-17 10:23

  • 2020-10-18 02:42

  • 2020-10-18 07:37

  • 2020-10-18 10:23

  • 2020-10-18 10:25

  • 2020-10-18 07:37

  • 2020-10-19 01:29

  • 2020-10-19 07:46

  • 2020-10-19 08:59

  • 2020-10-19 07:54 (question)

  • 2020-10-19 08:17 (question)

  • 2020-10-19 10:35 (question)

  • 2020-10-17 (question)

Saturday




Sunday





Monday







Tuesday

  • (see Prometheus graph below)

  • (see Prometheus graph below)

  • D/B Prod/Stage sync started but never completed

...

  • NDW ETL: Stage (PaaS) failed

...

  • ESR Sentry errors x 7 (reappearance of the same issue across all services)

...

  • TCS ES sync job failed to run/complete on either blue or green servers

...

  • NDW ETL: Stage (PaaS) failed

...

  • NDW ETL: Stage (current) failed

...

  • TCS ES sync job failed to run/complete on either blue or green servers

...

  • TCS ES Person sync job failed (None of the configured nodes were available)

...

  • Users started reporting problems using Search on Prod

...

4. Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.
There are dependencies between PRs. So restricting to one might have a chain reaction. No getting round that things like this will need manual intervention.
Restrict to only major version changes. Restricting the number of major dependencies might cause issues. So let’s see how restricting to major versions affects things and decide on further action if needed at that point

...

AndyD

...

AndyD / Pepe

...

Pipelines not written to take advantage of extra nodes - time investment greater than benefit.

...

6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process). Needs refining and sub-tasking with ESR and perm team, before ESR disappear.

...

ESR & perm team

...

7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

...

Pepe (question)

Lessons Learned (Good and Bad)

  • Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.

  • Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).

  • Incident has encouraged us to identify the route cause, and to identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.

Techy Stuff

5 whys

Everything fell over

  1. Jenkins

  2. ESR containers taking up all the resource

  3. Too many (Dependabot) PRs outstanding, builds, rebasing

  4. ESR did not had time to action them because of the launch of new world code

  • what else (double check VMs, Logs, etc)?

  • Discussion
    • Users reporting problems using Search on Prod had been resolved

    ...

    2020-10-19 07:54 (question) (see Prometheus graph below)

    ...

    2020-10-19 08:17 (question) (see Prometheus graph below)

    ...

    2020-10-19 10:35 (question) (see Prometheus graph below)

    ...

    2020-10-19 (question) massive Sentry hit, on ESR, using up our entire monthly allocation

    ...

    2020-10-20 07:30 Person Placement Employing Body Trust job failed to run/complete on either blue or green servers

    Prime Timeline

    • 2020-10-16 (Friday) 07:18 Staging RabbitMQ node 2 down

    • 2020-10-16 (Friday) 07:38 Prod ES node 3 down

    • 2020-10-16 (Friday) 07:38 Prod ES node 1 & 2 down - additional alert of too few nodes running - at this point, prod person search should not be working

    • 2020-10-16 (Friday) 07:58 Staging ES node 2 down

    • 2020-10-16 (Friday) 08:43 Phil W asks what this all means, Phil J summaries

    • 2020-10-16 (Friday) 12:03 Old concerns on green&blue stage goes down

    • 2020-10-16 (Friday) 16:38 Jenkins goes down

    • Same alerts continue over the weekend and ETL failures occur because ES is down

    • 2020-10-17 (Saturday) 01:13 high messages in RabbitMQ Prod

    • 2020-10-17 (Saturday) 01:28 high messages in RabbitMQ Staging

    • 2020-10-17 (Saturday) 07:08 Staging ES node 2 down, Prod RabbitMQ node 3 down

    • 2020-10-17 (Saturday) 07:18 Staging ES node 1 & 3 down

    • 2020-10-17 (Saturday) 07:33 Staging RabbitMQ node 1 & 3 down

    • 2020-10-17 (Saturday) 07:43 Prod Mongo goes down

    • can’t be bothered to go through any more alerts, everything is broken at this point

    Root Cause(s)

    • Dependabot does a monthly check on tooling versions and auto-generates PRs and builds which, in the case of the ESR area, creates many many concurrent containers which trips Jenkins up (not enough RAM)

    • Lots and lots of downstream processes rely on Jenkins being up

    • We haven’t been able to configure the monthly Dependabot sweep to avoid a massive concurrent hit of PRs / builds

    Trigger

    • Configuration of Dependabot to put less concurrent strain on Jenkins (question)

    • Dependencies of many timed jobs on Jenkins being available (question)

    • Not enough configuration of retries (question)

    Resolution

    • Manually brought Jenkins back up (question)

    • Manually restarted everything that had been affected by it being down (question)

    Detection

    • Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution

    • Bad news is we didn’t act on the initial Friday indicators, and John was rebuilding his machine over the weekend, so couldn’t do his normal knight in shiny armour stuff!

    Action Items

    ...

    Action Items

    ...

    Owner

    ...

    Status

    ...

    Comments

    ...

    1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

    ...

    Ops

    ...

    Done

    ...

    2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')
    ESR guys not keen at all on this. Risk of breaking the pipeline and blocking anything thereafter. When you need it the most…
    2. Code in AWS s3 creds to save firing up local stack. Need to be careful how to set this up. Shared bucket for integration tests across Reval, TISSS as well. With archiving rules deleting everything >5 days old.
    See ‘diet’ ticket below - lots of work, but an investment that will have its uses across all products within the TIS porfolio

    ...

    ESR & friends

    ...

    To do

    ...

    3. Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Fix major versions only. Ignore minor versions.

    ...

    ESR / Sachin (question)

    ...

    To do

    ...

    Consider ways to reduce system resources instead?
    Discussed with Doris and ESR team last week…

    • (see Prometheus graph below)

    • (see Prometheus graph below)

    • (see Prometheus graph below)

    • Massive Sentry hit, on ESR, using up our entire monthly allocation




















    Root Cause(s)

    • OS apt patch running every morning used the LivePatch function to apply the patch without needing to restart everything.

    • This caused conflict with Docker - Note, however, that this conflict has not been seen before since TIS’s inception. So the working theory is that this was a one-off conflict that is not likely to reoccur.

    • The index of trainees for the searchpage (elasticsearch) was unreachable and couldn’t start up.

    • The conflict then compromised everything else in a domino effect, compounded by the backed up Dependabot PRs and builds which, in the case of the ESR area, creates multiple concurrent containers which trips Jenkins up (not enough RAM)

    • Many downstream processes rely on Jenkins being up

    • We haven’t configured the monthly Dependabot sweep to stagger the hit of PRs / builds

    Trigger

    • 6am OS patch process auto-applied the patches via LivePatch, causing a conflict with Docker which needed to be restarted

    • Failure of Docker restarting following the OS patch puts strain on ElasticSearch and Jenkins (and everything else)

    • ElasticSearch issues compromised the search function for users of the app

    • Dependencies of many timed jobs on Jenkins being available

    • Not enough configuration of retries (question)

    Resolution

    • Restarted Jenkins

    • Restarted NDW ETLs on Stage (current and PaaS)

    • Restarted Docker on ES nodes

    • Restarted Neo4J container (rather than the service)

    • Spun up CDC (Paul) - looking at the HUGE backlog of messages from changes to TIS data (which started coming down rapidly. Phew!)

    • Removed all the dangling volumes from Mongo (several times) (old images that are taking up space)

    Detection

    • Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution

    • Users pointed out the issue with search not functioning correctly on TIS

    • Bad news is we didn’t act on the initial Friday indicators (they didn’t appear to be as serious an indicator as they were)

    Action Items

    Confirmed action Items

    Owner

    Status

    Comments

    0. Spike the creation an Ansible config to do a full restart (John Simmons (Deactivated) to complete the description of this) - in case we come across this situation happening again.

    Ops

    To do

    1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

    Ops

    Done

    2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')
    ESR guys not keen at all on this. Risk of breaking the pipeline and blocking anything thereafter. When you need them the most… don’t have tests switched off!
    2. Code in AWS S3 creds to save firing up local stack. Need to be careful setting this up. Shared bucket for integration tests across Reval, TISSS as well. With archiving rules deleting everything >5 days old.
    See ‘diet’ ticket below - lots of work, but an investment that will have its uses across all products within the TIS portfolio

    ESR & others

    To do

    3. Close outstanding ESR major version change PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Ignore minor versions.

    Sachin (question)

    To do

    Consider ways to reduce system resources instead?
    Discussed with Doris and ESR team last week…

    4. Restrict the number of PRs Dependabot opens on each ESR project to 1 and to major versions only (given they’re microservices, we still might get significant numbers).
    There are dependencies between PRs. So restricting to one might have a chain reaction. No getting round that things like this will need manual intervention.
    So let’s see how restricting to major versions affects things and decide on further action if needed at that point.

    AndyD

    To do

    5. Use Jenkins pipeline rather than node - to make things sequential rather than concurrent

    disableConcurrentBuilds
    Disallow concurrent executions of the Pipeline. Can be useful for preventing simultaneous accesses to shared resources, etc. For example: options { disableConcurrentBuilds() }

    AndyD / Pepe

    To do

    Pipelines not written to take advantage of extra nodes - is the time investment required greater than the benefit (question)

    6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process).
    Needs refining and sub-tasking with ESR and perm team, before ESR disappear.

    ESR & perm team

    To do

    Spike ticket completed this Sprint. This ticket is the result of that Spike.

    7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day. (question)

    Pepe (question)

    To do

    TBC

    Lessons Learned (Good and Bad)

    • Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.

    • Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).

    • Incident has encouraged us to do the most thorough route cause analysis we have ever done, and subsequently identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.

    ...

    Techy Stuff

    5 whys

    Everything fell over

    Comments following catch up 2020-10-21

    1. Jenkins

    2. ESR containers taking up all the resource

    3. Too many (Dependabot) PRs outstanding, builds, rebasing

    4. ESR did not have time to action them because of the launch of new world code

    1. Underlying OS upgrade occurred and was applied (LivePatch)

    2. Probably something to do with that very specific update (it’s never happened before and has been running since the Apps were built

    3. Did Docker struggle with the patch, rather than Amazon making a mistake with applying the patch (no complaints on Amazon forums, so looks quite specific to us)

    4. Version of Docker is linked to whatever’s available in apt

    5. Not worth changing our set up for the sake of a ‘freak’ occurrence that probably will only happen once every 3 years? Or scheduling the OS upgrades to a better time / day, so we can actively check that they’ve done what they were expected to do, and not caused conflicts. We can then determine whether a full restart immediately afterwards would be sensible too.

    Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue that triggered everything.

    It destabilised the system, stopping Docker. When Docker was manually restarted on Monday morning, everything started coming back up again. And stability was restored.

    Initial discussion, along with short and longer term actions

    What can we do about Dependabot creating and building simultaneously?

    Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.

    We could get Dependabot to add a GitHub label to the PR - add something to the Jenkins file to read the label and mark as “Don’t run”.
    But this stops Dependabot being useful.

    ESR preoccupied with launching New World, understandably!

    Can perm team keep on top of ESR stuff when they leave?

    Even when keeping on top of things, will it eventually be too much anyway?
    Or is it simply a case of the team not controlling the overall number of open PRs?

    Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing

    Is Jenkins the right tool for everything it’s being asked to do? No:

    1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

    2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')

    3. Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs

    4. Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.

    5. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

    6. move ETLs over to ECS tasks (serverless ‘run container' instructions to AWS - not reliant on our infrastructure).
      This would remove the dependency on Jenkins - so if it went down, the jobs could continue. N.B. This doesn’t apply to the ElasticSearch job.
      Don’t do scheduled jobs / anything with a timer - use Cron server instead for this stuff.
      Just use Jenkins as a build server (Metabase also runs on Jenkins, but doesn’t use much)

    7. ticket up addressing our infrastructure so that the set up ESR have created does run - it’s been done right!

    8. get ourselves a dedicated Jenkins server (what size (question))

    9. move to ElasticSearch-aa-S

    Tech notes

    • Related? TISNEW-5454

    • ESR Data Exporter changed - tried to build all containers - tried to fire up all the outstanding Dependabot (monthly) PRs (each PR spun up 30 odd containers).

    • Jenkins box only has 16Gb of RAM - can we bung up to 32? But this doubles the cost. Can we hit the route cause, rather than the size of RAM.

    • Turn off dependabot for now? Automatic rebasing?

    • Modify jobs to only build if, e.g.:

    • a) PR was raised by dependabot

    • b) There are other PRs for the project building

    • Single Jenkins node responsibility for control and building everything.

    • We haven’t moved to GHA yet.

    Person Sync (rather than Person Owner?) Job:

    • Prod-to-Stage didn’t failed on Saturday

    • First a problem with Elastic Search Sync job on Sunday

    • Docker restart made elastic-search available to ADMINS-UI again

    Graphs

    ...

    1. move to ElasticSearch SaaS

    Checklist of actions should take in future

    • One-off task: Look at the way updates are applied - AWS system manager - tell it when you want to do the patching. Schedule to patch Stage first, check and then apply to Prod / Then set up a simple Slack reminder to check all is well

    •  Manage user expectations - alert them AS SOON AS we know something’s amiss. AND keep updates flowing as we resolve things. UNDERSTAND what of our remedial actions will have what affect on users. UPDATE status alert on TIS (ensure it’s not just Phil KOTN who has access / does this
    •  When an ES node fails, generate an auto-restart (question)
    •  Check logs
    •  Platform-wide restart - ensure everything is brought back up ‘clean’
    •  Recheck

    Graphs

    ...