Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed) Sachin Mehta (Unlicensed) |
Status | LiveDefect resolved. Actions being ticketed up (Andy Nash (Unlicensed)) |
Summary | On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday |
Impact | No Stage. No Prod. No data syncing in various places |
...
First trigger | |||
| Friday |
| |
Starting Friday and continuing over the weekend | |||
| Friday |
| because of sharding, users may have been seeing only partial results at this point) (PH) 👈 need to look into and confirm this assumption… |
Then | |||
| Saturday |
| |
…at this point ☝ pretty much everything is being effected by the combination of issues | |||
👇 shall I remove this section of the timeline completely from this incident log? Seems overkill, given pretty much everything was down at this point | |||
| Saturday |
|
|
...
OS
apt
patch running every morning used the LivePatch function to apply the patch without needing to restart everything.This caused conflict with Docker - Note, however, that this conflict has not been seen before since TIS’s inception. So the working theory is that this was a one-off conflict that is not likely to reoccur.
The index of trainees for the searchpage (elasticsearch) was unreachable and couldn’t start up.
The conflict then compromised everything else in a domino effect, compounded by the backed up Dependabot PRs and builds which, in the case of the ESR area, creates multiple concurrent containers which trips Jenkins up (not enough RAM)
Many downstream processes rely on Jenkins being up
We haven’t configured the monthly Dependabot sweep to stagger the hit of PRs / builds
...
Restarted Jenkins
Restarted NDW ETLs on Stage (current and PaaS)
Restarted Docker on ES nodes
Restarted Neo4J container (rather than the service)
Spun up CDC (Paul) - looking at the HUGE backlog of messages from changes to TIS data (which started coming down rapidly. Phew!)
Removed all the dangling volumes from Mongo (several times) (old images that are taking up space)
...
Everything fell over | Comments following catch up 2020-10-21 | |
---|---|---|
|
| Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue that we believe triggered everything. |
Initial discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No:
|
...