Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Date
 
AuthorsJoseph (Pepe) Kelly
StatusMissing applicant file uploaded to ESR on 2019-05-22?
Summary

ESR applicant-export failed on 21st May due to a connection error.  The ETL received a 401 (Unauthorized) exception but we don't yet know why.

ImpactNew Applicants for Yorkshire & Humberside were not received by ESR until  

Jira reference

TISNEW-2963 - Getting issue details... STATUS

Impact

New Applicants for Yorkshire & Humberside were not received by ESR until  

Details of the scheduled jobs are here: ESR Schedules.

Root Causes

  • A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
  • The job didn't complete before the FTP sync initiated.

Trigger

  • Authentication failure of the request from the ESR-ETL to TCS.

Resolution

  • Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
  • Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.

  • Validate the file was processed by ESR the next day.

Detection / Timeline

  • 2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
  • 2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
  • 2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
  • 2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran.
  • 2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
  • 2019-05-23 ????: Checked that file was processed by ESR


  • datetime: description

Action Items

  • Create new ticket to address the large data problem - allow the data to be run in batches in future TISNEW-2966 - Getting issue details... STATUS
  • Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June

Lessons Learned

  • There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
  • Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc

What went well

  • Teamwork - finding a temporary solution

What went wrong

  • Too many other sync jobs getting in the way of a lengthy ETL
  • Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)

Where we got lucky

Supporting information

  • No labels