2019-05-07 ESR ETL failed on notifications-daily-load
Date | |
Authors | Simon Meredith (Unlicensed) |
Status | Data uploaded to ESR on 2019-05-11 |
Summary | ESR notifications-daily-load failed on 7th May due to the size of the data (relating to the major rotation date of 7th August). TCS got an OOM error which killed the process. The process was restarted but then took too long so that other scheduled jobs killed it again. |
Impact | Notification (changes to placements) were not produced and were not available to ESR until 11th May |
Jira reference
- TISNEW-2963Getting issue details... STATUS .
Impact
Placement changes for the main rotation date in August (1st Wednesday) were not sent to ESR
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- 47000 placements not sent to ESR.
- ESR-ETL (notification daily load) failed.
- Process took a long time and other jobs started which restarted TCS (and OOM on TCS).
- 47000 placements processed in one go.
- ESR ETL is not configured to process the placements in batches.
- It wasn't designed that way
Trigger
- .
Resolution
- Route all other traffic through Green
Change ESR ETL to allow a parameter to be passed which provides a specific date for the notification load so that they can be picked up retrospectively if necessary (rather than just using the current system date)
Kick off the notification load routine on Blue
- Wait for process to finish before re-enabling Blue for other traffic
Detection / Timeline
- 0600 7th May: ESR ETL started.
- 0600 7th May: ESR ETL found to have failed (at some point)
- 0900 7th May: ESR ETL (notification-daily-load) Started again
- 1039 7th May: ESR ETL found to have failed (at some point)
- 1300 7th May: ESR ETL (notification-daily-load) Started again
- 1500 7th May: All jobs jobs disabled from 8pm that evening to allow notification-daily-load opportunity to run
- 2000 7th May: ESR ETL (notification-daily-load) Started again (to give largest window before other processes start)
- 0400 8th May: ESR ETL failed (TCS GC overhead)
- 0900 8th May: Hard memory limit removed from TCS
- 1400 9th May: ESR-ETL code change to allow parameter to be passed to specify the date of the notifications (instead of the default of LocalDate.now())
- 1500 9th May: ESR ETL started
- 1730 10th May: ESR ETL completed
- 19:45 10th May: Notifications-Daily-Export ran to produce the CSV files. Since the 18:00 upload job was missed, the files needed to be copied into the following day's Azure Outbound folder to be picked up by the next run.
- 12:44 11th May: 14 CSV files moved to 2019-05-11 Outbound folder
- 18:00 11th May the 14 CSV files alond with the BAU files produced by the ESR-ETL uploaded to ESR
- 18:06 11th May: Emails confirmation fro ESR of those files having been picked up
- 09:30 13th May: Identified 9 more files that were produced by the run on 11th May at 12:44. Also did a count of the number of records across the 23 Files (17k approx. worth of notifications. Worth noting that the 47k can be reduced to about less than half as, this implied both current and next placements in the initial count along with some duplicate notification type 1 which is then grouped to workout the final set of notifications to be built in the files.)
- 01:57 13th May: Additional 9 files copied to 13th Outbound folder to be uploaded at 18:00
Action Items
- Create new ticket to address the large data problem - allow the data to be run in batches in future - TISNEW-2966Getting issue details... STATUS
- Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June
Lessons Learned
- There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
- Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc
What went well
- Teamwork - finding a temporary solution
What went wrong
- Too many other sync jobs getting in the way of a lengthy ETL
- Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)
Where we got lucky
Supporting information
- Slack conversation in: fire_fire-2019-05_07
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213