...
Restarted build server
Restarted National Data Warehouse ETLs on Stage environmentand Prod environments
Restarted Docker on search server
Restarted Neo4J GraphDB
Turned on Change Data Capture system to deal with backlog of data changes from TIS to ESR
Removed all the old images from MongoDB, one of our databases
...
Look into being able to do a full restart automatically if a similar situation arises
Increase the memory (RAM) on the build server
Implement and close any remaining patches to ESR which were automatically created
Restrict the number of dependancies dependencies that can be upgraded at one time to reduce the load on the build server
Make changes to build server to move from concurrent to sequential builds
Look at removing the nightly job that refreshes search data as this may no longer be necessary
...
On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday
...
Timeline
...
Root Cause(s)
...
Trigger
...
...
Detection
...
Action Items
Timeline (collation and interpretation from across all Slack monitoring channels)
First trigger |
c. 6am 2020-10-16 06:33 2020-10-16 06:36 2020-10-16 06:42 2020-10-16 06:??
| Friday | Patch comes out for the kernel ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot - upgrade at Git repo level) one ES cluster node failed amazon-ssm-agent kicked off an apt upgrade (at OS level): which included xxd , python 2.7 , vim , python 2.6 , linux AWS 5.4 headers . old versions of xxd , python 2.7 , vim , python 2.6 , linux AWS 5.4 headers were removed - the upgrades were auto-applied without a restart via LivePatch.
| |
Starting Friday and continuing over the weekend |
2020-10-16 07:18 2020-10-16 07:38 2020-10-16 07:38
2020-10-16 07:58 2020-10-16 08:43 2020-10-16 12:03 2020-10-16 16:38
| Friday | Staging RabbitMQ node 2 down (PH) Prod ES node 3 down (PH) Prod ES node 1 & 2 down - additional alert of too few nodes running - at this point, prod person search should not be working
Staging ES node 2 down (PH) Phil W asks what this all means, Phil J summaries (PH) Old concerns on green&blue stage goes down (PH) Jenkins goes down (PH)
|
because of sharding, users may have been seeing only partial results at this point) (PH) 👈 need to look into and confirm this assumption… |
Then |
6am 2020-10-17 01:13 2020-10-17 01:28 2020-10-17 06:00 2020-10-17 06:00 2020-10-17 06:02 2020-10-17 07:08 2020-10-17 07:18 2020-10-17 07:33 2020-10-17 07:43
| Saturday | Patch comes out for the kernel High messages in RabbitMQ Prod (PH) High messages in RabbitMQ Staging (PH) ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09 ESR ETL ES Docker container failure (see Sachin’s snippet on Slack channel) Staging ES node 2 down, Prod RabbitMQ node 3 down (PH) Staging ES node 1 & 3 down (PH) Staging RabbitMQ node 1 & 3 down (PH) Prod Mongo goes down (PH)
| |
…at this point ☝ pretty much everything is being effected by the combination of issues |
👇 shall I remove this section of the timeline completely from this incident log? Seems overkill, given pretty much everything was down at this point |
2020-10-17 08:48 2020-10-17 09:27 2020-10-17 10:23 2020-10-17 10:23 2020-10-18 02:42 2020-10-18 07:37 2020-10-18 10:23 2020-10-18 10:25 2020-10-18 07:37 2020-10-19 01:29 2020-10-19 07:46 2020-10-19 08:59 2020-10-19 07:54 2020-10-19 08:17 2020-10-19 10:35 2020-10-17
| Saturday
Sunday
Monday
Tuesday | (see Prometheus graph below) (see Prometheus graph below) D/B Prod/Stage sync started but never completed NDW ETL: Stage (PaaS) failed ESR Sentry errors x 7 (reappearance of the same issue across all services) TCS ES sync job failed to run/complete on either blue or green servers NDW ETL: Stage (PaaS) failed NDW ETL: Stage (current) failed TCS ES sync job failed to run/complete on either blue or green servers TCS ES Person sync job failed (None of the configured nodes were available) Users started reporting problems using Search on Prod Users reporting problems using Search on Prod had been resolved (see Prometheus graph below) (see Prometheus graph below) (see Prometheus graph below) Massive Sentry hit, on ESR, using up our entire monthly allocation
|
|
Root Cause(s)
OS apt
patch running every morning used the LivePatch function to apply the patch without needing to restart everything.
This caused conflict with Docker - Note, however, that this conflict has not been seen before since TIS’s inception. So the working theory is that this was a one-off conflict that is not likely to reoccur.
The index of trainees for the searchpage (elasticsearch) was unreachable and couldn’t start up.
The conflict then compromised everything else in a domino effect, compounded by the backed up Dependabot PRs and builds which, in the case of the ESR area, creates multiple concurrent containers which trips Jenkins up (not enough RAM)
Many downstream processes rely on Jenkins being up
We haven’t configured the monthly Dependabot sweep to stagger the hit of PRs / builds
...
Everything fell over | Comments following catch up 2020-10-21 |
---|
Jenkins ESR containers taking up all the resource Too many (Dependabot) PRs outstanding, builds, rebasing ESR did not have time to action them because of the launch of new world code
| Underlying OS upgrade occurred and was applied (LivePatch) Probably something to do with that very specific update (it’s never happened before and has been running since the Apps were built Did Docker struggle with the patch, rather than Amazon making a mistake with applying the patch (no complaints on Amazon forums, so looks quite specific to us) Version of Docker is linked to whatever’s available in apt Not worth changing our set up for the sake of a ‘freak’ occurrence that probably will only happen once every 3 years? Or scheduling the OS upgrades to a better time / day, so we can actively check that they’ve done what they were expected to do, and not caused conflicts. We can then determine whether a full restart immediately afterwards would be sensible too.
| Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue that we believe triggered everything.
It destabilised the system, stopping Docker. When Docker was manually restarted on Monday morning, everything started coming back up again. And stability was restored. |
Initial discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent. We could get Dependabot to add a GitHub label to the PR - add something to the Jenkins file to read the label and mark as “Don’t run”. But this stops Dependabot being useful.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Or is it simply a case of the team not controlling the overall number of open PRs? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No: bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months? disable integration tests on ESR projects for PR pipeline (they’d still run on merge to master, rather than each PR). These are what fire up the local stack and test containers (hold back on…'if we’re not planning to do any further ESR work once Leeerrroooy leave, we could just disable the integration tests in ESR') Close outstanding ESR PRs - how many is ‘critical mass’? But without being blazé about approving PRs Restrict the number of PRs Dependabot opens on each ESR project to 1 (but given they’re microservices, it still might be a big number). Not much of a concern if we do 2. above. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day. move ETLs over to ECS tasks (serverless ‘run container' instructions to AWS - not reliant on our infrastructure). This would remove the dependency on Jenkins - so if it went down, the jobs could continue. N.B. This doesn’t apply to the ElasticSearch job. Don’t do scheduled jobs / anything with a timer - use Cron server instead for this stuff. Just use Jenkins as a build server (Metabase also runs on Jenkins, but doesn’t use much) ticket up addressing our infrastructure so that the set up ESR have created does run - it’s been done right! get ourselves a dedicated Jenkins server (what size ) move to ElasticSearch SaaS
| | |
...