Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed) Sachin Mehta (Unlicensed) |
Status | LiveDefect resolved. Actions being ticketed up (Andy Nash (Unlicensed)) |
Summary | On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday |
Impact | No Stage. No Prod. No data syncing in various places |
...
First trigger | |||
| Friday |
| |
Starting Friday and continuing over the weekend | |||
| Friday |
| because of sharding, users may have been seeing only partial results at this point) (PH) 👈 need to look into and confirm this assumption… |
Then | |||
| Saturday |
| |
…at this point ☝ pretty much everything is being effected by the combination of issues | |||
👇 shall I remove this section of the timeline completely from this incident log? Seems overkill, given pretty much everything was down at this point | |||
| Saturday |
|
|
...
Everything fell over | Comments following catch up 2020-10-21 | |
---|---|---|
|
| Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue that we believe triggered everything. |
Initial discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No:
|
...