Date | |
Authors | Joseph (Pepe) Kelly |
Status | Missing applicant file uploaded to ESR on 2019-05-22? |
Summary | ESR applicant-export failed on 21st May due to a connection error. The ETL received a 401 (Unauthorized) exception but we don't yet know why. |
Impact | New Applicants for Yorkshire & Humberside were not received by ESR until |
Jira reference
- TISNEW-2963Getting issue details... STATUS .
Impact
New Applicants for Yorkshire & Humberside were not received by ESR until
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
- The job didn't complete before the FTP sync initiated.
Trigger
- Authentication failure of the request from the ESR-ETL to TCS.
Resolution
- Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.
- Validate the file was processed by ESR the next day.
Detection / Timeline
- 2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
- 2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
- 2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
- 2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran.
- 2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
- 2019-05-23 ????: Checked that file was processed by ESR
- datetime: description
Action Items
- Create new ticket to address the large data problem - allow the data to be run in batches in future - TISNEW-2966Getting issue details... STATUS
- Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June
Lessons Learned
- There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
- Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc
What went well
- Teamwork - finding a temporary solution
What went wrong
- Too many other sync jobs getting in the way of a lengthy ETL
- Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)
Where we got lucky
Supporting information
- Slack conversation in: fire_fire-2019-05_07
Add Comment