2024-08-20 No notifications exported to ESR since 10th
Date | Aug 20, 2024 |
Authors | @Rob Pink @Joseph (Pepe) Kelly @Yafang Deng |
Status | Documenting |
Summary | The absence of notifications was clocked and on investigation it was found that none had been sent from 10 August |
Impact | https://hee-tis.atlassian.net/browse/TIS21-6411 Trusts did not receive notifications as early as they usually would. All notifications from this period were sent on the 20th Aug. |
Non-technical Description
Notifications are usually sent to ESR every day, describing changes and updates to the “now & next” people in positions. The service did not generate the files for 11th - 19th Aug because it was unable to complete the necessary queries and updates to the database within the allocated transaction time. Throughout this time, Applicant records were sent and confirmations received as normal.
Any notifications which should have gone out during this time and were still valid on the 20th were sent. For example the “now &next”(TM Pepe) notifications for job changeovers between 11th and 19th were sent on the 20th.
Trigger
We are confident without reproducing this that the quantity of notifications amongst other operational use triggered the combination of factors leading to this unplanned delay.
Detection
We noticed a lack of “Confirmation” files from ESR.
Resolution
Upsized the resources to enable to exporting jobs to run.
Timeline
Jul 31, 2024 Shovel setup but was persistent rather than temporary.
Aug 10, 2024 14:10 Last successful notification file generation.
Aug 11, 2024 14:00 Repeated failures prevent .
Aug 20, 2024 We noticed the delay to notification confirmations beyond ‘normal’ delays
Aug 20, 2024 11:30 resized cluster, and it took 12mins
Aug 20, 2024 11:50 deleted the shovel which was sending errors to a temporary queue
Aug 20, 2024 12:10 and around indication on Metabase that the files for notifications have been created.
Aug 20, 2024 15:27 - Received confirmation files
5 Whys (or other analysis of Root Cause)
We didn’t receive DCC conformation files because we didn’t send any files for ESR to confirm receipt of.
The attempts to build notifcation files failed (errors were logged as warnings but not reported via Sentry)
Database transactions timed out.
The database didn’t have the resources to complete the transaction within the time limit.
Action Items
Action Items | Owner | Comments |
---|---|---|
Maintain increased (maximum) tier. Offset additional cost by scheduling cluster availability and size | @Joseph (Pepe) Kelly | |
Change data retention to improve performance Not right now. Probably in the future. |
|
|
|
|
|
See also:
Lessons Learned
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213