Page Comparison

...

Jira Legacy

server	System JIRA
columns	key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TISNEW-29633001

.

Impact

New Applicants for Yorkshire & Humberside were not received by ESR until 22 May 2019

...

A 401 (Unauthorized) response was received for a HTTP request from ESR-ETL to TCS.
The second run of the job didn't complete before the FTP sync initiated.

...

Check which files (if any) weren't sent to ESR (comparing export ETLs and FTP Sync job in #esr_operations channel in slack)
Moved missing file to the 'outbound' folder in Azure for today (22nd) ready to be processed at 18:00.
Validate the file was processed by ESR the next day.
Add application level retries to cope with temporary connectivity issues.

Detection / Timeline

2019-05-21 1700: Ansible message to #esr_operations channel reporting failure.
2019-05-21 1745 (approx.): Message from Ansible picked up and job run manually via jenkins.
2019-05-21 1800: FTP Sync runs and picks up all but 1 file.
2019-05-22 1130: Investigation started. Found that there was 1 file placed in Azure after the FTP sync ran and the last file uploaded was processed by ESR.
2019-05-22 1458: Copied file from yesterday's outbound folder (2019-05-21) to today's outbound folder (2019-05-21).
2019-05-23 ????0807: Checked that file was processed by ESRdatetime: description. Email in #esr_emails confirms file contents

Action Items

Create new ticket to address the large data problem - allow the data to be run in batches in future
Jira Legacy
server System JIRA
serverId 4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key TISNEW-2966
Note that September is also a busy rotation time for trainees which will affect the load at the beginning of June

Lessons Learned

There are certain dates in the year when there will be large amounts of data to send to ESR (3 months before a lot of trainees rotate between jobs)
Other sync jobs will cause TCS to be restarted which will stop any in-progress ETLs etc

What went well

Teamwork - finding a temporary solution

What went wrong

Too many other sync jobs getting in the way of a lengthy ETL
Monitoring insufficient - we weren't able to see if the ETL was actually running correctly. We were blind to what was happening until it was complete (some 26 hours later)Raise ticket to include retries (as one type of service resilience) for connection issues, e.g. a configurable list of HTTP status codes

Lessons Learned

401 (Unauthorized) HTTP Responses are not always due to the service or profile being down/restarting.
Slack notifications can be configured for keywords

What went well

It was simple to rectify the issue with this particular instance.

What went wrong

A transient connection problem caused the job to fail.
The jobs are very close to each other. The restarted job didn't complete before the next in the chain ran.

Where we got lucky

Supporting information

...

ESR log:

2019-05-21 16:00:39.199 ERROR 1 — main o.s.boot.SpringApplication : Application startup failed

java.lang.IllegalStateException: Failed to execute ApplicationRunner
...
at com.transformuk.hee.tis.esr.Application.main(Application.java:49) 1.0.27
...

Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://tcs:8093/tcs/api/placements/filter": {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}; nested exception is java.io.IOException: {
"timestamp" : "2019-05-21T16:00:39.185+0000",
"status" : 401,
"error" : "Unauthorized",
"message" : "Unauthorized",
"path" : "/tcs/api/placements/filter"
}

TCS and Profile logs don't show

Version	Old Version 1	New Version Current
Changes made by	Joseph (Pepe) Kelly	Joseph (Pepe) Kelly
Saved on	22 May 2019	28 May 2019

Versions Compared

Key

Impact

Detection / Timeline

Action Items

Lessons Learned

What went well

What went wrong

Lessons Learned

What went well

What went wrong

Where we got lucky

Supporting information