...
Jira Legacy | ||||||||
---|---|---|---|---|---|---|---|---|
|
Impact
New Applicants for Yorkshire & Humberside were not received by ESR until
...
- A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
- The second run of the job didn't complete before the FTP sync initiated.
...
- Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.
- Validate the file was processed by ESR the next day.
- Add application level retries to cope with temporary connectivity issues.
Detection / Timeline
- 2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
- 2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
- 2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
- 2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran and the last file uploaded was processed by ESR.
- 2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
- 2019-05-23 ????0807: Checked that file was processed by ESRdatetime: description. Email in #esr_emails confirms file contents
Action Items
- Create new ticket to address the large data problem - allow the data to be run in batches in future
Jira Legacy server System JIRA serverId 4c843cd5-e5a9-329d-ae88-66091fcfe3c7 key TISNEW-2966 - Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June
Lessons Learned
- There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
- Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc
What went well
- Teamwork - finding a temporary solution
What went wrong
- Too many other sync jobs getting in the way of a lengthy ETL
- Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)Raise ticket to include retries (as one type of service resilience) for connection issues, e.g. a configurable list of HTTP status codes
Lessons Learned
- 401 (Unauthorized) HTTP Responses are not always due to the service or profile being down/restarting.
- Slack notifications can be configured for keywords
What went well
- It was simple to rectify the issue with this particular instance.
What went wrong
- A transient connection problem caused the job to fail.
- The jobs are very close to each other. The restarted job didn't complete before the next in the chain ran.
Where we got lucky
Supporting information
...
ESR log:
2019-05-21 16:00:39.199 ERROR 1 — main o.s.boot.SpringApplication : Application startup failed
java.lang.IllegalStateException: Failed to execute ApplicationRunner
...
at com.transformuk.hee.tis.esr.Application.main(Application.java:49) 1.0.27
...
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://tcs:8093/tcs/api/placements/filter": {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}; nested exception is java.io.IOException: {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}
TCS and Profile logs don't show