Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Jira Legacy
serverSystem JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTISNEW-29633001

Impact

New Applicants for Yorkshire & Humberside were not received by ESR until  

...

  • A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
  • The second run of the job didn't complete before the FTP sync initiated.

...

  • Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
  • Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.

  • Validate the file was processed by ESR the next day.
  • Add application level retries to cope with temporary connectivity issues.

Detection / Timeline

  • 2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
  • 2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
  • 2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
  • 2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran and the last file uploaded was processed by ESR.
  • 2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
  • 2019-05-23 ????0807: Checked that file was processed by ESRdatetime: description. Email in #esr_emails confirms file contents

Action Items

  • Create new ticket to address the large data problem - allow the data to be run in batches in future
    Jira Legacy
    serverSystem JIRA
    serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
    keyTISNEW-2966
  • Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June

Lessons Learned

  • There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
  • Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc

What went well

  • Teamwork - finding a temporary solution

What went wrong

  • Too many other sync jobs getting in the way of a lengthy ETL
  • Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)Raise ticket to include retries (as one type of service resilience) for connection issues, e.g. a configurable list of HTTP status codes

Lessons Learned

  • 401 (Unauthorized) HTTP Responses are not always due to the service or profile being down/restarting.
  • Slack notifications can be configured for keywords

What went well

  • It was simple to rectify the issue with this particular instance.

What went wrong

  • A transient connection problem caused the job to fail.
  • The jobs are very close to each other. The restarted job didn't complete before the next in the chain ran.

Where we got lucky

Supporting information

...

ESR log:

2019-05-21 16:00:39.199 ERROR 1 — main o.s.boot.SpringApplication : Application startup failed

java.lang.IllegalStateException: Failed to execute ApplicationRunner
...
at com.transformuk.hee.tis.esr.Application.main(Application.java:49) 1.0.27
...

Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://tcs:8093/tcs/api/placements/filter": {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}; nested exception is java.io.IOException: {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}

TCS and Profile logs don't show