Date	19 Oct 2020
Authors	Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed)
Status	LiveDefect resolved. Actions being ticketed up (Andy Nash (Unlicensed))
Summary	On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday
Impact	No Stage. No Prod. No data syncing in various places

Timeline (collation and interpretation from across all Slack monitoring channels)

First trigger

Patch comes out for the kernel: 6am
2020-10-16 06:33 ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot - upgrade at Git repo level)
2020-10-16 06:36 one ES cluster node failed
2020-10-16 06:42 amazon-ssm-agent kicked off an “apt” upgrade (at OS level): xxd, python 2.7, vim, python 2.6, linux AWS 5.4 headers, etc.
2020-10-16 06:?? old versions of xxd, python 2.7, vim, python 2.6, linux AWS 5.4 headers, etc were removed - so something has restarted the server automatically, or the upgrades were applied without a restart, LivePatch.

...

2020-10-17 08:48 (see Prometheus graph below)
2020-10-17 09:27 (see Prometheus graph below)
2020-10-17 10:23 D/B Prod/Stage sync started but never completed
2020-10-17 10:23 NDW ETL: Stage (PaaS) failed
2020-10-18 02:42 ESR Sentry errors x 7 (reappearance of the same issue across all services)
2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers
2020-10-18 10:23 NDW ETL: Stage (PaaS) failed
2020-10-18 10:25 NDW ETL: Stage (current) failed
2020-10-18 07:37 TCS ES sync job failed to run/complete on either blue or green servers
2020-10-19 01:29 TCS ES Person sync job failed (None of the configured nodes were available)
2020-10-19 07:46 Users started reporting problems using Search on Prod
2020-10-19 08:59 Users reporting problems using Search on Prod had been resolved
2020-10-19 07:54 (see Prometheus graph below)
2020-10-19 08:17 (see Prometheus graph below)
2020-10-19 10:35 (see Prometheus graph below)
2020-10-19 massive Sentry hit, on ESR, using up our entire monthly allocation
2020-10-20 07:30 Person Placement Employing Body Trust job failed to run/complete on either blue or green servers

Root Cause(s)

...

Dependabot does a monthly check on tooling versions and auto-generates PRs and builds which, in the case of the ESR area, creates many many concurrent containers which trips Jenkins up (not enough RAM)
Lots and lots of downstream processes rely on Jenkins being up
We haven’t been able to configure the monthly Dependabot sweep to avoid a massive concurrent hit of PRs / builds

Trigger

Configuration of Dependabot inadvertently puts strain on Jenkins
Dependencies of many timed jobs on Jenkins being available
Not enough configuration of retries

Resolution

Manually brought Jenkins back up
Manually restarted everything that had been affected by it being down

Detection

Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution
Bad news is we didn’t act on the initial Friday indicators, and John was rebuilding his machine over the weekend, so couldn’t do his normal knight in shiny armour stuff!

Action Items

...

Confirmed action Items

...

Owner

...

Status

...

Comments

...

0. Spike the creation an Ansible config to do a full restart (John Simmons (Deactivated) to complete the description of this) - in case we come across this situation happening again.

Can we put a checklist of actions together that were taken to try to get us up and running again?

Ideally this checklist could be a ref for future failures

...

Ops

...

To do

...

1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

...

Ops

...

Done

...

2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')
ESR guys not keen at all on this. Risk of breaking the pipeline and blocking anything thereafter. When you need them the most… don’t have tests switched off!
2. Code in AWS S3 creds to save firing up local stack. Need to be careful setting this up. Shared bucket for integration tests across Reval, TISSS as well. With archiving rules deleting everything >5 days old.
See ‘diet’ ticket below - lots of work, but an investment that will have its uses across all products within the TIS portfolio

...

ESR & others

...

To do

...

3. Close outstanding ESR major version change PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Ignore minor versions.

...

Sachin

...

To do

...

Consider ways to reduce system resources instead?
Discussed with Doris and ESR team last week…

...

4. Restrict the number of PRs Dependabot opens on each ESR project to 1 and to major versions only (given they’re microservices, we still might get significant numbers).
There are dependencies between PRs. So restricting to one might have a chain reaction. No getting round that things like this will need manual intervention.
So let’s see how restricting to major versions affects things and decide on further action if needed at that point.

...

AndyD

...

To do

...

AndyD / Pepe

...

To do

...

Pipelines not written to take advantage of extra nodes - is the time investment required greater than the benefit

...

6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process).
Needs refining and sub-tasking with ESR and perm team, before ESR disappear.

...

ESR & perm team

...

To do

...

Spike ticket completed this Sprint. This ticket is the result of that Spike.

...

7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

...

Pepe

...

To do

...

TBC

Lessons Learned (Good and Bad)

Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.
Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).
Incident has encouraged us to identify the route cause, and to identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.

Techy Stuff

5 whys

...

Everything fell over

...

Comments following catch up 2020-10-21

...

Jenkins
ESR containers taking up all the resource
Too many (Dependabot) PRs outstanding, builds, rebasing
ESR did not had time to action them because of the launch of new world code

...

Underlying OS upgrade occurred and was applied (LivePatch)
Probably something to do with that very specific update (it’s never happened before and has been running since the Apps were built
Did Docker struggle with the patch, rather than Amazon making a mistake with applying the patch (no complaints on Amazon forums, so looks quite specific to us)
Version of Docker is linked to whatever’s available in apt
Not worth changing our set up for the sake of a ‘freak’ occurrence that probably will only happen once every 3 years? Or switching off the OS upgrades and handling it manually every week / switch off everything and restart it regularly?

...

Initial discussion, along with short and longer term actions

What can we do about Dependabot creating and building simultaneously?

Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.

~~We could get Dependabot to add a GitHub label to the PR - add something to the Jenkins file to read the label and mark as “Don’t run”. But this stops Dependabot being useful.~~

ESR preoccupied with launching New World, understandably!

Can perm team keep on top of ESR stuff when they leave?

Even when keeping on top of things, will it eventually be too much anyway?
Or is it simply a case of the team not controlling the overall number of open PRs?

Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing

Is Jenkins the right tool for everything it’s being asked to do? No:

...

bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?

...

disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')

...

Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs

...

Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.

...

The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.

...

move ETLs over to ECS tasks (serverless ‘run container' instructions to AWS - not reliant on our infrastructure).
This would remove the dependency on Jenkins - so if it went down, the jobs could continue.
Don’t do scheduled jobs / anything with a timer - use Cron server instead for this stuff.
Just use Jenkins as a build server (~~Metabase also runs on Jenkins, but doesn’t use much~~)

...

ticket up addressing our infrastructure so that the set up ESR have created does run - it’s been done right!

...

OS `apt` patch running every morning used the LivePatch function to apply the patch without needing to restart everything.
This caused conflict with Docker - Note, however, that this conflict has not been seen before since TIS’s inception. So the working theory is that this was a one-off conflict that is not likely to reoccur.
The conflict then compromised everything else in a domino effect, compounded by the backed up Dependabot PRs and builds which, in the case of the ESR area, creates multiple concurrent containers which trips Jenkins up (not enough RAM)
Many downstream processes rely on Jenkins being up
We haven’t configured the monthly Dependabot sweep to stagger the hit of PRs / builds

Trigger

6am OS patch process auto-applied the patches via LivePatch, causing a conflict with Docker which needed to be restarted
Failure of Docker restarting following the OS patch puts strain on ElasticSearch and Jenkins (and everything else)
ElasticSearch issues compromised the search function for users of the app
Dependencies of many timed jobs on Jenkins being available
Not enough configuration of retries

Resolution

Restarted Jenkins
Restarted NDW ETLs on Stage (current and PaaS)
Restarted Docker on ES nodes
Restarted Neo4J container (rather than the service)
Spun up CDC (Paul) - looking at the HUGE backlog of messages (which started coming down rapidly. Phew!)
Removed all the dangling volumes from Mongo (several times) (old images that are taking up space)

Detection

Good news is that there were lots of monitoring alerts in various channels in Slack and from the graphs, below, there was probably enough evidence there to have alerted us to a problem needing resolution
Users pointed out the issue with search not functioning correctly on TIS
Bad news is we didn’t act on the initial Friday indicators (they didn’t appear to be as serious an indicator as they were)

Action Items

Confirmed action Items	Owner	Status	Comments
0. Spike the creation an Ansible config to do a full restart (John Simmons (Deactivated) to complete the description of this) - in case we come across this situation happening again. Can we put a checklist of actions together that were taken to try to get us up and running again? Ideally this checklist could be a ref for future failures	Ops	To do
1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?	Ops	Done
2. disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR') ESR guys not keen at all on this. Risk of breaking the pipeline and blocking anything thereafter. When you need them the most… don’t have tests switched off! 2. Code in AWS S3 creds to save firing up local stack. Need to be careful setting this up. Shared bucket for integration tests across Reval, TISSS as well. With archiving rules deleting everything >5 days old. See ‘diet’ ticket below - lots of work, but an investment that will have its uses across all products within the TIS portfolio	ESR & others	To do
3. Close outstanding ESR major version change PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Ignore minor versions.	Sachin	To do	Consider ways to reduce system resources instead? Discussed with Doris and ESR team last week…
4. Restrict the number of PRs Dependabot opens on each ESR project to 1 and to major versions only (given they’re microservices, we still might get significant numbers). There are dependencies between PRs. So restricting to one might have a chain reaction. No getting round that things like this will need manual intervention. So let’s see how restricting to major versions affects things and decide on further action if needed at that point.	AndyD	To do
5. Use Jenkins pipeline rather than node - to make things sequential rather than concurrent disableConcurrentBuilds Disallow concurrent executions of the Pipeline. Can be useful for preventing simultaneous accesses to shared resources, etc. For example: `options { disableConcurrentBuilds() }`	AndyD / Pepe	To do	Pipelines not written to take advantage of extra nodes - is the time investment required greater than the benefit
6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process). Needs refining and sub-tasking with ESR and perm team, before ESR disappear.	ESR & perm team	To do	Spike ticket completed this Sprint. This ticket is the result of that Spike.
7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.	Pepe	To do	TBC

Lessons Learned (Good and Bad)

Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.
Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).
Incident has encouraged us to identify the route cause, and to identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.

...

Techy Stuff

5 whys

Everything fell over

Comments following catch up 2020-10-21

Jenkins
ESR containers taking up all the resource
Too many (Dependabot) PRs outstanding, builds, rebasing
ESR did not had time to action them because of the launch of new world code

Underlying OS upgrade occurred and was applied (LivePatch)
Probably something to do with that very specific update (it’s never happened before and has been running since the Apps were built
Did Docker struggle with the patch, rather than Amazon making a mistake with applying the patch (no complaints on Amazon forums, so looks quite specific to us)
Version of Docker is linked to whatever’s available in apt
Not worth changing our set up for the sake of a ‘freak’ occurrence that probably will only happen once every 3 years? Or switching off the OS upgrades and handling it manually every week / switch off everything and restart it regularly?

Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue to be investigated and confirmed…

Destabilised the system, stopping Docker. When Docker restarted, everything started coming back up again.

Initial discussion, along with short and longer term actions

What can we do about Dependabot creating and building simultaneously?

Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.

~~We could get Dependabot to add a GitHub label to the PR - add something to the Jenkins file to read the label and mark as “Don’t run”. But this stops Dependabot being useful.~~

ESR preoccupied with launching New World, understandably!

Can perm team keep on top of ESR stuff when they leave?

Even when keeping on top of things, will it eventually be too much anyway?
Or is it simply a case of the team not controlling the overall number of open PRs?

Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing

Is Jenkins the right tool for everything it’s being asked to do? No:

bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months?
disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR')
Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs
Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above.
The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day.
move ETLs over to ECS tasks (serverless ‘run container' instructions to AWS - not reliant on our infrastructure).
This would remove the dependency on Jenkins - so if it went down, the jobs could continue.
Don’t do scheduled jobs / anything with a timer - use Cron server instead for this stuff.
Just use Jenkins as a build server (~~Metabase also runs on Jenkins, but doesn’t use much~~)
ticket up addressing our infrastructure so that the set up ESR have created does run - it’s been done right!
get ourselves a dedicated Jenkins server (what size )
move to ElasticSearch SaaS

...

Modify jobs to only build if, e.g.:

PR was raised by dependabot (and is a major version change)
There are other PRs for the project building

Single Jenkins node responsibility for control and building everything.

We haven’t moved to GHA yet.

Person Sync (rather than Person Owner?) Job:

Prod-to-Stage didn’t failed on Saturday
First a problem with Elastic Search Sync job on Sunday
Docker restart made elastic-search available to ADMINS-UI again

Checklist of actions we took in this instance

Restarted Jenkins
Restarted NDW ETLs on Stage (current and PaaS)
Restarted Docker on ES nodes
Restarted Neo4J container (rather than the service)
Spun up CDC (Paul) - looking at the HUGE backlog of messages (which started coming down rapidly. Phew!)
Removed all the dangling volumes from Mongo (several times) (old images that are taking up space)

dependabot (and is a major version change)
There are other PRs for the project building

Single Jenkins node responsibility for control and building everything.

We haven’t moved to GHA yet.

Person Sync (rather than Person Owner?) Job:

Prod-to-Stage didn’t failed on Saturday
First a problem with Elastic Search Sync job on Sunday
Docker restart made elastic-search available to ADMINS-UI again

Checklist of actions should take in future

...

Versions Compared

Old Version 14

New Version 15

Key

Timeline (collation and interpretation from across all Slack monitoring channels)

Root Cause(s)

Trigger

Resolution

Detection

Action Items

Lessons Learned (Good and Bad)

Techy Stuff

Everything fell over

Initial discussion, along with short and longer term actions

Trigger

Resolution

Detection

Action Items

Lessons Learned (Good and Bad)

Techy Stuff

Everything fell over

Initial discussion, along with short and longer term actions

Person Sync (rather than Person Owner?) Job:

Checklist of actions we took in this instance

Person Sync (rather than Person Owner?) Job:

Checklist of actions should take in future

Page Comparison

Versions Compared

Old Version 14

New Version 15

Key

Timeline (collation and interpretation from across all Slack monitoring channels)

Root Cause(s)

Trigger

Resolution

Detection

Action Items

Lessons Learned (Good and Bad)

Techy Stuff

Everything fell over

Initial discussion, along with short and longer term actions

Trigger

Resolution

Detection

Action Items

Lessons Learned (Good and Bad)

Techy Stuff

Everything fell over

Initial discussion, along with short and longer term actions

Person Sync (rather than Person Owner?) Job:

Checklist of actions we took in this instance

Person Sync (rather than Person Owner?) Job:

Checklist of actions should take in future