Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed) |
Status | In progress |
Summary | On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday |
Impact | No Stage. No Prod. No data syncing in various places |
...
Timeline (collation and interpretation from across all Slack monitoring channels)
First trigger
Patch comes out for the kernel: 6am
2020-10-16 06:33 ESR Data exporter triggered a build of outstanding PRs (resulting from Dependabot - upgrade at Git repo level)
2020-10-16 06:36 one ES cluster node failed
2020-10-16 06:42 amazon-ssm-agent kicked off an “apt” upgrade (at OS level): xxd, python 2.7, vim, python 2.6, linux AWS 5.4 headers, etc.
2020-10-16 06:?? old versions of xxd, python 2.7, vim, python 2.6, linux AWS 5.4 headers, etc were removed - so something has restarted the server automatically, or the upgrades were applied without a restart, LivePatch.
Starting Friday and continuing over the weekend
2020-10-16 07:18 Staging RabbitMQ node 2 down (PH)
2020-10-16 07:38 Prod ES node 3 down (PH)
2020-10-16 07:38 Prod ES node 1 & 2 down - additional alert of too few nodes running - at this point, prod person search should not be working (because of sharding, users may have been seeing only partial results at this point) (PH)
☝ need to look into and confirm this assumption…2020-10-16 07:58 Staging ES node 2 down (PH)
2020-10-16 08:43 Phil W asks what this all means, Phil J summaries (PH)
2020-10-16 12:03 Old concerns on green&blue stage goes down (PH)
2020-10-16 16:38 Jenkins goes down (PH)
Then
Patch comes out for the kernel: 6am
2020-10-17 01:13 high messages in RabbitMQ Prod (PH)
2020-10-17 01:28 high messages in RabbitMQ Staging (PH)
2020-10-17 06:00 ESR n-d-l cron job didn’t start - manually kicked by Paul at 10:08, exited at 10:09
2020-10-17 06:00 ESR ETL
2020-10-17 06:02 ES Docker container failure (see Sachin’s snippet on Slack channel)
2020-10-17 07:08 Staging ES node 2 down, Prod RabbitMQ node 3 down (PH)
2020-10-17 07:18 Staging ES node 1 & 3 down (PH)
2020-10-17 07:33 Staging RabbitMQ node 1 & 3 down (PH)
2020-10-17 07:43 Prod Mongo goes down (PH)
...
Confirmed action Items | Owner | Status | Comments |
---|---|---|---|
0. Spike the creation an Ansible config to do a full restart (John Simmons (Deactivated) to complete the description of this) - in case we come across this situation happening again. Can we put a checklist of actions together that were taken to try to get us up and running again? Ideally this checklist could be a ref for future failures | Ops | To do | |
1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months? | Ops | Done | |
| ESR & others | To do | |
3. Close outstanding ESR major version change PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Ignore minor versions. | Sachin | To do | Consider ways to reduce system resources instead? |
4. Restrict the number of PRs Dependabot opens on each ESR project to 1 and to major versions only (given they’re microservices, we still might get significant numbers). | AndyD | To do | |
5. Use Jenkins pipeline rather than node - to make things sequential rather than concurrent | AndyD / Pepe | To do | Pipelines not written to take advantage of extra nodes - is the time investment required greater than the benefit |
6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process). | ESR & perm team | To do | Spike ticket completed this Sprint. This ticket is the result of that Spike. |
7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day. | Pepe | To do | TBC |
...
5 whys
Everything fell over | Comments following catch up 2020-10-21 | |
---|---|---|
|
| Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue to be investigated and confirmed… |
Initial discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No:
|
...
Prod-to-Stage didn’t failed on Saturday
First a problem with Elastic Search Sync job on Sunday
Docker restart made elastic-search available to ADMINS-UI again
Graphs
...
Checklist of actions we took in this instance
- Restarted Jenkins
- Restarted NDW ETLs on Stage (current and PaaS)
- Restarted Docker on ES nodes
- Restarted Neo4J container (rather than the service)
- Spun up CDC (Paul) - looking at the HUGE backlog of messages (which started coming down rapidly. Phew!)
- Removed all the dangling volumes from Mongo (several times) (old images that are taking up space)
Checklist of actions should take in future
One-off task: Look at the way updates are applied - AWS system manager - tell it when you want to do the patching. Schedule to patch Stage first, check and then apply to Prod / Then set up a simple Slack reminder to check all is well
- Manage user expectations - alert them AS SOON AS we know something’s amiss. AND keep updates flowing as we resolve things. UNDERSTAND what of our remedial actions will have what affect on users. UPDATE status alert on TIS (ensure it’s not just Phil KOTN who has access / does this
- When an ES node fails, generate an auto-restart
- Check logs
- Platform-wide restart - ensure everything is brought back up ‘clean’
- Recheck
Graphs
...