In Progress

Date

21 Apr 2023

Authors

Joseph (Pepe) Kelly Yafang Deng Steven Howard

Status

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-4449

Summary

Bulk upload page is continually refreshing and showing “the server took too long to respond“.

Impact

Users can not use Bulk upload as usual.

Sometimes when bulk upload service is up for a short period of time, users are able to upload the file. Once the service is restarting, it can miss the response from other services and the file is stalled in progress.

Non-technical Description

Bulk upload service is continually restarted and bulk upload webpage is continually refreshing. This meant that users were less able to submit and check their bulk uploads for a large part of Friday. Some users were able to submit smaller uploads which were processed but the failure continued to reoccur until the TIS team intervened.

On Thursday afternoon, 3 uploads of 1 or more spreadsheets had lots of rows that were blank other than a hyphen in the address field. Bulk upload treated these as rows that required processing so produced significant numbers of errors (see below). By temporarily allocating more resources to the service that processes uploads, it was able to cope with the additional pressure of letting users know about the number of errors.

...

Trigger

There were 3 super large file uploaded and completed with thousands of errors.

...

Detection

User queries:

Liam Lofthouse: Morning General - is there an issue with the bulk upload page? I seem…

posted in TIS Support Channel / General at 21 April 2023 09:28:59

...

In #monitoring-ndw channel on Slack, no notifications found for ndw-etl-prod task:

...

Resolution

...

Manually ran the ndw-etl-prod on ECS.

...

Resolution

Increased the jvm memory from 0.5G to 4G, and reserved memory (container memory) for bulk upload from 1G to 5G, until the files with lots of errors were no longer loaded on the first bulk upload page
Backed up the ApplicationType records for those 3 large files.

...

Timeline

BST unless otherwise stated

11 20 Apr 2023 04 14:03- Since then, no more notifications had been received for the overnight ndw etl jobs.
11 Apr 2023 11:32 - NDW team emailed about the failure
11 Apr 2023 12:07 - James redirected the email that push of TIS data into NDW failed
11 Apr 2023 12:40 - Yafang found no logs found on ndw-etl-prod in the midnight
11 Apr 2023 13:02 - Andy D found the event bridge triggered the task and the service considered itself healthy, but no record for the task on prod
11 Apr 2023 14:26 - Yafang & Jay triggered the task ndw-etl-prod manually
11 Apr 2023 14:56 - The task has been run and exited on etl-prod successfully
11 Apr 2023 15:11 - James let Guy know and NDW started their jobs.
12 Apr 2023 13:45 - Guy confirmed there were no subsequent issues.14:36 - Several uploads with significant numbers of errors.
21 Apr 2023 09:28 - User report that page keeps refreshing
21 Apr 2023 09:30 - Found service was running OutOfMemory. Logs didn’t give indication why. Dashboard indicated that there *may* have been an issue since the previous day.
21 Apr 2023 09:30-13:00 - Made hotfix changes to deployment configuration to capture additional information. Found a valid cause for the additional memory use. We modified the configuration to give it more resource but missed additional resource constraints.
21 Apr 2023 13:49 User report that an uploaded file was stalled. We assumed this was because the service restarted when the file was being processed.
21 Apr 2023 14:40-15:40 - Modified and monitored the service further to check that it was stable, notifying the Teams channel at 15:55.
21 Apr 2023 16:15 - Got the same issue replicated on Stage after manually uploading those 3 large files (logId: 1682001412239, 1681999967423, 1681999429355 on Prod).
21 Apr 2023 16:30 - Modified ApplicationType.errorJson column for uploaded 3 large files on Stage to a single error. And checked Stage was resolved.
21 Apr 2023 17:10 - Exported the records of the 3 large filds from table ApplicationType locally for future use

...

Root Cause(s)

We expect the ndw-etl-prod job to be triggered by the AWS eventbridge rule every day at 2am UTC.
From the metrics, the everntbridge rule was triggered on Apr 11, but there’re no logs found on Cloudwatch. And from the ECSStoppedTasksEvent, we can also find the ndw-etl-prod task was not started.
The CloudTrail event history shows the reason of failure: "Capacity is unavailable at this time. Please try again later or in a different availability zone"
We could ask AWS why there was no capacity. The AWS Service Status indicates there was no failure or maintenance reducing expected capacity.Intermittently the bulk upload screen on TIS was constantly refreshing
Admins-UI refreshes screen on errors received from backend services (503 http error from /status API)
503 http error indicated that generic upload service was not available.
In bulk upload service logs: we found java.lang.OutOfMemoryError: Java heap space. When the API /status was called, bulk upload service ran out of memory.
When someone looks at the status of bulk upload jobs, the errors for the visible jobs are loaded from the database and at this time it included an exceptional amount of data (~180K rows of errors)
The file uploaded had lots of rows with just a hyphen / didn’t follow the template format

...

Action Items

Action Items

Comments

Owner

Monitor CloudTrail event. Once we get this error, pick up another time to restart the task.

Alert to Slack channel if runTask failsFix up service deployment configuration (volume mappings for logs & heap dump)

Preferred: Move the service to ECS

Don’t know if ECS would make the heap dumps available

Joseph (Pepe) Kelly / Jayanta Saha

Improve memory use: Change what columns are retrieved from the database for the /status search.

Yafang Deng

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-4383

Change config to use a wider choice of availability zones.

Other stuff we won’t do, at least for now, given the assumption about low probability of re-occurrence:

Automated retries
A separate job to check that the ETL has run for each environment

...

Analyse the data uploaded.

View file

name	Analysis of rows in files uploaded via bulk upload.pdf

This would be to inform setting limits on the number of rows that are uploaded.

Steven Howard or James Harris / Stan Ewenike (Unlicensed)

Get feedback from Local Office about what happened

James Harris

Lessons Learned

We noticed there were 3 large files at the first sight, but didn’t recognise them as the root cause in the very beginning. It was because the data received in the API response doesn’t contain the error messages.

But the backend service does load them from the DB.

If some thing looks unusual (too big!) on the UI, it’s probably the cause.

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned