Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Yafang DengJoseph (Pepe) Kelly

Status

Documenting

Summary

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-4127

Impact

One file is shown “IN PROGRESS” and bulk upload is not picking up the following files any more

Non-technical Description

When files are uploaded on bulk upload page, the file are uploaded to S3 and a record will be inserted into db table genericupload.ApplicationType. Then generic upload service will pick up the uploaded files one by one in order. But only when the current file is marked with “COMPLETED“, the following “PENDING“ one will get processed.

...

Trigger

  • A file of person update template with 1180 records was uploaded.

  • On prod, the id (Epoch time) of the file is: 1674667043653 and the record id in genericupload.ApplicationType is 28781.

...

Detection

  • User’s query on Teams

  • bulk upload file list on Admins-UI

  • docker logs on Prod Green

...

Resolution

  • Get all the person id from the spreadsheet, and query all the emails by person ids from metabase, then compare the emails from DB and emails in spreadsheet and find the discrepancies.

  • Inform the user the discrepancies and ask for a manual update.

  • Set the status of the stalled file from “IN_PROGRESS“ to “COMPLETED“.

  • Restart docker generic upload on Prod Green and find the service is picking up the following pending files on Admins-UI or docker logs.

...

Timeline

BST unless otherwise stated

  • 17:17 user uploaded a file of person update template with 1180 records to update doctor emails.

  • 09:35 users reported that the bulk upload job is still showing in process and all the following files were stuck.

  • 10:00 the team discussed the user queries on standup.

  • 11:10 most of the records in the stalled file were found to be alrady updated on Prod.

  • 12:34 a PR merged to retry the stalled file. schema_version got the new installed version, and jobStartTime in ApplicationType was updated for the job, but there were no logs found on TCS.

  • 14:30ish a manual update on status of the job in progress to “PENDING“ was done, but it skipped over that spreadsheet.

  • 15:00ish tcs cloudwatch logs were re-checked by Pepe & Yafang together and they agreed the spreadsheet had already been process.

  • 16:00ish as expected, only one email was found not to be updated for that spreadsheet.

  • 16:16 user informed with the status of the job

  • 16:20 a manual update on status of the job from “IN_PROGRESS“ to “COMPLETED“ and then a docker restart was done. Generic upload service was resumed.

...

Root Cause(s)

  • For person update, Generic Upload service assembles all the DTOs together, and sends them to TCS at all once.

  • Then in TCS, it verifies all the gmc/gdc, person details, contact details, right to work, roles, trainerApprovals, etc on existing person entity together with the data from spreadsheet. It took very long time if there’re many records are uploaded in one file.

  • After the validation, TCS saves the data into several DB tables one record by one record, which is also time-consuming.

  • Only when all the processing is done in TCS, it sends a response back to Generic Upload service. So there possibly have been timeouts between TCS and Generic Upload.

...

Action Items

Action Items

Comments

Owner

Lessons Learned

...

  • When we find out that there’re logs showing the data have already been updated, it’s good to have a quick comparison on current data and data from spreadsheet. So Generic Upload service can be resumed more quickly.