2023-01-26 Stalled processing of bulk upload
Date | Jan 26, 2023 |
Authors | @Yafang Deng@Joseph (Pepe) Kelly @catherine.odukale (Unlicensed) @Fiaz Adam (Deactivated) @Rob Pink @Fiaz Adam (Deactivated) |
Status | Done |
Summary | |
Impact | One file was shown “IN PROGRESS” and bulk upload was not picking up the following files any more. Those other files were delayed until we changed the status of the blocking file. They were then processed. |
Non-technical Description
When files are uploaded to the bulk upload page, the files are uploaded to storage and the jobs are recorded with the status “PENDING“. Then the bulk upload service starts to pick up the first file, and set its status to “IN PROGRESS“, leaving all other following files “PENDING“. Only when the current file is marked with “COMPLETED“, the following “PENDING“ ones will get processed.
Files that were uploaded with a bulk upload, which is when more than a few files are sent to be uploaded, was not being completed. This is likely due to a timeout issue, the reason is because of the limits of connection between the Bulk Upload service and a system named TCS.
Trigger
A file of person update template with 1180 records was uploaded.
On prod, the id (Epoch time) of the file is: 1674667043653 and the record id in
genericupload.ApplicationType
is 28781.
Detection
User’s query on Teams
Resolution
Get all the person id from the spreadsheet, and query all the emails by person ids from metabase, then compare the emails from DB and emails in spreadsheet and find the discrepancies.
Inform the user the discrepancies and ask for a manual update.
Set the status of the stalled file from “IN_PROGRESS“ to “COMPLETED“.
Restart docker generic upload on Prod Green and find the service is picking up the following pending files on Admins-UI or docker logs.
Timeline
BST unless otherwise stated
Jan 25, 2023 17:17 user uploaded a file of person update template with 1180 records to update doctor emails.
Jan 26, 2023 09:35 users reported that the bulk upload job is still showing in process and all the following files were stuck.
Jan 26, 2023 10:00 the team discussed the user queries on standup.
Jan 26, 2023 11:10 most of the records in the stalled file were found to be already updated on Prod.
Jan 26, 2023 12:34 a PR merged to retry the stalled file.
schema_version
got the new installed version, andjobStartTime
inApplicationType
was updated for the job, but there were no logs found on TCS.Jan 26, 2023 14:30ish a manual update on status of the job in progress to “PENDING“ was done, but it skipped over that spreadsheet.
Jan 26, 2023 15:00ish tcs cloudwatch logs were re-checked by Pepe & Yafang together and they agreed the spreadsheet had already been process.
Jan 26, 2023 16:00ish as expected, only one email was found not to be updated for that spreadsheet.
Jan 26, 2023 16:16 user informed with the status of the job
Jan 26, 2023 16:20 a manual update on status of the job from “IN_PROGRESS“ to “COMPLETED“ and then a docker restart was done. Generic upload service was resumed.
Root Cause(s)
A file was processing and hit an issue that stopped it completing. The other files were pending because it queues them up.
The logs show that the first file was started but not finished but the TCS logs indicated the first & last people in the file were saved, which took a number of minutes. Not initially, but later we later confirmed that all expected records had been updated. The problem occurred between the data being received by TCS and Bulk Upload generating the results.
There was likely a timeout because of the time taken to process all the records in the file and the limits of connection between Bulk Upload & TCS.
For person update, Generic Upload service assembles all the DTOs together, and sends them to TCS at all once.
Then in TCS, it verifies all the gmc/gdc, person details, contact details, right to work, roles, trainerApprovals, etc on existing person entity together with the data from spreadsheet. It took very long time if there’re many records are uploaded in one file.
After the validation, TCS saves the data into several DB tables one record by one record, which is also time-consuming.
Only when all the processing is done in TCS, it sends a response back to Generic Upload service. So there possibly have been timeouts between TCS and Generic Upload.
When initially developed, there were very few bits of stateful validation with lookups but there have been additions which don’t take minutes to process for “normal”/smaller numbers of updates.
There is a hard limit to the maximum timeout for requests without “keep-alive” packets in some newer infrastructure we are now using.
Action Items
Action Items | Comments | Owner |
---|---|---|
Add a new endpoint in TCS for single person validation&update, and modify Person Update bulk upload to send person one by one. | When we re-design bulk upload service, this should be taken into account. |
|
Find the limit of Person Update before we hit the timeouts and apply the limit on the Generic upload service Add a message on UI to inform users not to exceed the limit |
| @catherine.odukale (Unlicensed) |
Generic upload sends “keep-alive“ packets to keep connection |
| @Yafang Deng |
Lessons Learned
When we find out that there’re logs showing the data have already been updated, it’s good to have a quick comparison on current data and data from spreadsheet. So Generic Upload service can be resumed more quickly.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213