Date | |
Authors | Simon Meredith (Unlicensed) |
Status | Data uploaded to ESR on 2019-05-11 |
Summary | ESR notifications-daily-load failed on 7th May due to the size of the data (relating to the major rotation date of 7th August). TCS got an OOM error which killed the process. The process was restarted but then took too long so that other scheduled jobs killed it again. |
Impact | Notification (changes to placements) were not produced and were not available to ESR until 11th May |
Jira reference
- TISNEW-2963Getting issue details... STATUS .
Impact
Placement changes for the main rotation date in August (1st Wednesday) were not sent to ESR
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- 47000 placements not sent to ESR.
- ESR-ETL (notification daily load) failed.
- Process took a long time and other jobs started which restarted TCS (and OOM on TCS).
- 47000 placements processed in one go.
- ESR ETL is not configured to process the placements in batches.
- It wasn't designed that way
Trigger
- .
Resolution
- Route all other traffic through Green
Change ESR ETL to allow a parameter to be passed which provides a specific date for the notification load so that they can be picked up retrospectively if necessary (rather than just using the current system date)
Kick off the notification load routine on Blue
- Wait for process to finish before re-enabling Blue for other traffic
Detection / Timeline
- 0200 7th May: ESR ETL started.
- 0600 7th May: ESR ETL found to have failed (at some point)
- 0900 7th May: ESR ETL (notification-daily-load) Started again
- 1300 7th May: ESR ETL (notification-daily-load) Started again
- 1500 7th May: All jobs jobs disabled from 8pm that evening to allow notification-daily-load opportunity to run
- 2000 7th May: ESR ETL (notification-daily-load) Started again (to give largest window before other processes start)
- 0400 8th May: ESR ETL failed (TCS GC overhead)
- 0900 8th May: Hard memory limit removed from TCS
- 1400 9th May: ESR-ETL code change to allow parameter to be passed to specify the date of the notifications (instead of the default of LocalDate.now())
- 1500 9th May: ESR ETL started
- 1730 10th May: ESR ETL completed
Action Items
- Create new ticket to address the large data problem - allow the data to be run in batches in future - TISNEW-2966Getting issue details... STATUS
- Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June
Lessons Learned
- There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
- Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc
What went well
- Teamwork - finding a temporary solution
What went wrong
- Too many other sync jobs getting in the way of a lengthy ETL
- Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)
Where we got lucky
Supporting information
- Slack conversation in: fire_fire-2019-05_07
Add Comment