2019-05-21 ESR ETL failed on applicant-export
Date | |
Authors | Joseph (Pepe) Kelly |
Status | Missing applicant file uploaded to ESR on 2019-05-22? |
Summary | ESR applicant-export failed on 21st May due to a connection error. The ETL received a 401 (Unauthorized) exception but we don't yet know why. |
Impact | New Applicants for Yorkshire & Humberside were not received by ESR until |
Jira reference
- TISNEW-3001Getting issue details... STATUS .
Impact
New Applicants for Yorkshire & Humberside were not received by ESR until
Details of the scheduled jobs are here: ESR Schedules.
Root Causes
- A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
- The second run of the job didn't complete before the FTP sync initiated.
Trigger
- Authentication failure of the request from the ESR-ETL to TCS.
Resolution
- Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.
- Validate the file was processed by ESR the next day.
- Add application level retries to cope with temporary connectivity issues.
Detection / Timeline
- 2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
- 2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
- 2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
- 2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran and the last file uploaded was processed by ESR.
- 2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
- 2019-05-23 0807: Checked that file was processed by ESR. Email in #esr_emails confirms file contents
Action Items
- Raise ticket to include retries (as one type of service resilience) for connection issues, e.g. a configurable list of HTTP status codes
Lessons Learned
- 401 (Unauthorized) HTTP Responses are not always due to the service or profile being down/restarting.
- Slack notifications can be configured for keywords
What went well
- It was simple to rectify the issue with this particular instance.
What went wrong
- A transient connection problem caused the job to fail.
- The jobs are very close to each other. The restarted job didn't complete before the next in the chain ran.
Where we got lucky
Supporting information
ESR log:
2019-05-21 16:00:39.199 ERROR 1 — main o.s.boot.SpringApplication : Application startup failed
java.lang.IllegalStateException: Failed to execute ApplicationRunner
...
at com.transformuk.hee.tis.esr.Application.main(Application.java:49) 1.0.27
...
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://tcs:8093/tcs/api/placements/filter": {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}; nested exception is java.io.IOException: {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}
TCS and Profile logs don't show
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213