2019-05-21 ESR ETL failed on applicant-export

Date
 
AuthorsJoseph (Pepe) Kelly
StatusMissing applicant file uploaded to ESR on 2019-05-22?
Summary

ESR applicant-export failed on 21st May due to a connection error.  The ETL received a 401 (Unauthorized) exception but we don't yet know why.

ImpactNew Applicants for Yorkshire & Humberside were not received by ESR until  

Jira reference

TISNEW-3001 - Getting issue details... STATUS

Impact

New Applicants for Yorkshire & Humberside were not received by ESR until  

Details of the scheduled jobs are here: ESR Schedules.

Root Causes

  • A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
  • The second run of the job didn't complete before the FTP sync initiated.

Trigger

  • Authentication failure of the request from the ESR-ETL to TCS.

Resolution

  • Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
  • Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.

  • Validate the file was processed by ESR the next day.
  • Add application level retries to cope with temporary connectivity issues.

Detection / Timeline

  • 2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
  • 2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
  • 2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
  • 2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran and the last file uploaded was processed by ESR.
  • 2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
  • 2019-05-23 0807: Checked that file was processed by ESR. Email in #esr_emails confirms file contents

Action Items

  • Raise ticket to include retries (as one type of service resilience) for connection issues, e.g. a configurable list of HTTP status codes

Lessons Learned

  • 401 (Unauthorized) HTTP Responses are not always due to the service or profile being down/restarting.
  • Slack notifications can be configured for keywords

What went well

  • It was simple to rectify the issue with this particular instance.

What went wrong

  • A transient connection problem caused the job to fail.
  • The jobs are very close to each other. The restarted job didn't complete before the next in the chain ran.

Where we got lucky

Supporting information

ESR log:

2019-05-21 16:00:39.199 ERROR 1 — main o.s.boot.SpringApplication : Application startup failed

java.lang.IllegalStateException: Failed to execute ApplicationRunner
...
at com.transformuk.hee.tis.esr.Application.main(Application.java:49) 1.0.27
...

Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://tcs:8093/tcs/api/placements/filter": {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}; nested exception is java.io.IOException: {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}

TCS and Profile logs don't show