Date |
|
Authors | Andy Nash (Unlicensed) Joseph (Pepe) Kelly John Simmons (Deactivated) Andy Dingley Simon Meredith (Unlicensed) Paul Hoang (Unlicensed) Sachin Mehta (Unlicensed) |
Status | LiveDefect resolved. Actions being ticketed up (Andy Nash (Unlicensed)) |
Summary | On Friday evening we noticed the server used to build our applications and run scheduled jobs was struggling. It then crashed, subsequently causing ETL (scheduled data flows between systems) and data related issues. Systems used to search, transfer and display data in TIS along with data stores then also froze between Friday and Saturday |
Impact | Users unable to search for trainees. Staging environment was not updated with production data on Saturday as scheduled No data synchronisation between systems |
...
Restarted build server
Restarted National Data Warehouse ETLs on Stage environmentand Prod environments
Restarted Docker on search server
Restarted Neo4J GraphDB
Turned on Change Data Capture system to deal with backlog of data changes from TIS to ESR
Removed all the old images from MongoDB, one of our databases
...
Look into being able to do a full restart automatically if a similar situation arises
Increase the memory (RAM) on the build server
Implement and close any remaining patches to ESR which were automatically created
Restrict the number of dependancies dependencies that can be upgraded at one time to reduce the load on the build server
Make changes to build server to move from concurrent to sequential builds
Look at removing the nightly job that refreshes search data as this may no longer be necessary
to build server to move from concurrent to sequential builds
Look at removing the nightly job that refreshes search data as this may no longer be necessary
Lessons Learned (Good and Bad)
Dev team to check the monitoring channels more regularly
Act on anything unusual (resolve immediately if easy to do so, alert others immediately if it appears serious, raise on stand ups otherwise).
Incident has encouraged us to do the most thorough route cause analysis we have ever done, and subsequently identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.
...
Technical Detail
Summary
On Friday evening we saw Jenkins struggling, and then fell over, subsequently causing ETL and data related issues. Elasticsearch, RabbitMQ and MongoDB then also fell over between Friday and Saturday
Timeline
...
(
...
...
...
...
Timeline (collation and interpretation from across all Slack monitoring channels)
First trigger | |||
| Friday |
| |
Starting Friday and continuing over the weekend | |||
| Friday |
| because of sharding, users may have been seeing only partial results at this point) (PH) 👈 need to look into and confirm this assumption… |
Then | |||
| Saturday |
| |
…at this point ☝ pretty much everything is being effected by the combination of issues | |||
👇 shall I remove this section of the timeline completely from this incident log? Seems overkill, given pretty much everything was down at this point | |||
| Saturday |
|
Root Cause(s)
OS
apt
patch running every morning used the LivePatch function to apply the patch without needing to restart everything.This caused conflict with Docker - Note, however, that this conflict has not been seen before since TIS’s inception. So the working theory is that this was a one-off conflict that is not likely to reoccur.
The index of trainees for the searchpage (elasticsearch) was unreachable and couldn’t start up.
The conflict then compromised everything else in a domino effect, compounded by the backed up Dependabot PRs and builds which, in the case of the ESR area, creates multiple concurrent containers which trips Jenkins up (not enough RAM)
Many downstream processes rely on Jenkins being up
We haven’t configured the monthly Dependabot sweep to stagger the hit of PRs / builds
...
Confirmed action Items | Owner | Status | Comments |
---|---|---|---|
0. Spike the creation an Ansible config to do a full restart (John Simmons (Deactivated) to complete the description of this) - in case we come across this situation happening again. | Ops | To do | |
1. bump up the Jenkins RAM to 32Gb (short term ONLY). Add a reminder to revisit this in 1 month / 2 months? | Ops | Done | |
| ESR & others | To do | |
3. Close outstanding ESR major version change PRs - how many is ‘critical mass’? But without being blazé about approving PRs. Ignore minor versions. | Sachin | To do | Consider ways to reduce system resources instead? |
4. Restrict the number of PRs Dependabot opens on each ESR project to 1 and to major versions only (given they’re microservices, we still might get significant numbers). | AndyD | To do | |
5. Use Jenkins pipeline rather than node - to make things sequential rather than concurrent | AndyD / Pepe | To do | Pipelines not written to take advantage of extra nodes - is the time investment required greater than the benefit |
6. Address https://hee-tis.atlassian.net/browse/TISNEW-5613 quickly (removing local stack from the Data exporter process). | ESR & perm team | To do | Spike ticket completed this Sprint. This ticket is the result of that Spike. |
7. The ElasticSearch nightly sync shouldn’t be necessary. Verify that ElasticSearch is being updated properly during the day. | Pepe | To do | TBC |
Lessons Learned (Good and Bad)
Check monitoring channels in Slack, check Prometheus, check Grafana as a matter of professional pride, daily.
Act* on anything unusual (* resolve yourself if you can, alert others immediately if you think it’s serious, raise on stand ups otherwise).
Incident has encouraged us to do the most thorough route cause analysis we have ever done, and subsequently identify some inefficiencies in related areas, and to map out a range of actions short and longer term to address them all.
...
Everything fell over | Comments following catch up 2020-10-21 | |
---|---|---|
|
| Jenkins could to with some TLC. ES went down too (before Jenkins). However, there is an underlying OS update issue that we believe triggered everything. |
Initial discussion, along with short and longer term actionsWhat can we do about Dependabot creating and building simultaneously? Dependabot does run sequentially, but much faster than Jenkins can process things so everything appears concurrent.
ESR preoccupied with launching New World, understandably! Can perm team keep on top of ESR stuff when they leave? Even when keeping on top of things, will it eventually be too much anyway? Original Jenkins build was never designed to handle this much load - underlying architecture isn’t there for the level of automation we now have. It is designed for a single node, not load-balancing Is Jenkins the right tool for everything it’s being asked to do? No:
|
...