Non-technical Description
Trigger
Detection
User queries:
Teams Support channel
Resolution
Restarted service
After testing the file in the stage environment, found that it was .
Tested the file on local, it took more than 2 hours:
Timeline
BST unless otherwise stated
13:55 - File uploaded and starts being processed.
13:55-15:24 - Other users upload files and the file was being processed
15:24 - Error logged
15:25 - User report on Teams
15:24-15:40 - Service monitored for signs that data was still being processed and once it became clear it wasn’t, the service was restarted
15:53 - Service processes queued files
17:29 - Admin user tried uploading the same Placement Create file again, and then it was processed successfully
Root Cause(s)
When admin users raised the query, the job has already spent 1.5 hours. (12:55:51 UTC - 14:25 UTC)
We thought the job was stalled, but it was not. Until generic upload service was restarted, the job had been processing.
Below image shows the record of 1702nd row in the spreadsheet. And there’re 1788 rows in total.
Action Items
Action Items | Comments | Owner |
---|---|---|
Lessons Learned
Look for logs to check if the job is really stalled… pair up whenever possible.
Add Comment