Date

05 Feb 2024

Authors

Status

Done

Summary

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5681

We received loads of sentry errors which showed that recommendation changes over the weekend wouldn’t have been reflected in the reval search.

Impact

No user impact determined.

Table of Contents

Non-technical Description

Data storage (ETL) that feeds information for use failed which meant that not all data were processed or completed as a result users were getting incomplete data. There were some changes related to Post Funding which had been released to the pre production environment for testing. The “Continuous Deployment” approach meant that the job that creates information in the Data Warehouse was updated before the source of the new information had been created. This meant that it was unable to copy any post funding information across.

A user reported the issue in a report that uses the Post Funding information. The data was refreshed, updating the report.

We have now released the latest versions of the source and job, meaning that as Post Funding information is added, it will be available in the NDW and available for new or updated reports.

Trigger

Automated deployment to prod

Detection

Slack

Resolution

Re-ran ETL with previous version
Set scheduled run to use previous version

We have a job running up to once per minute that follows changes in the revalidation database. It keeps track of what the most recent change processed was in order to know where it should pick up next time. We received an alert that the reference to that latest change couldn’t be used. The reference to the latest recommendation change was no longer available. We were fortunate that changes to doctors continue.

We moved the old reference and made a change to the code to enable the job to restart tracking changes.

Between Saturday morning and Monday morning some changes in the Reval Database would not have been reflected on Reval UI, however as there were no changes during this period

Trigger

There were a lot of Sentry errors at about 2,800. It doesn’t look like anything recommendations were made and so we were fortunate nothing was affected.
Combination of 3 hour window having elapsed and transaction log rolling over

...

Detection

...

Resolution

...

Timeline

All times in GMT unless indicated

25 Jan 2024 - Change to the preprod environment that required a change in the ETL
26 Jan 2024 01:07 - Matching change to the ETL is made
26 Jan 2024 01:25 - Pull Request merged
26 Jan 2024 02:00 & 02:30 - The updated ETL is run for NHS E & NIMDTA respectively
26 Jan 2024 05:39 - The previous versions are run on production
29 Jan 2024 - Changes released to prod environments and updated the scheduled event to use latest task definitions02 Feb 2024 18:07 - Last recommendation submitted.
03 Feb 2024 05:00-05:29 - Backup window
03 Feb 2024 05:00-05:29 - Backup window
03 Feb 2024 05:29 - Earliest Slack message identifying an issue with the production CDC Lambda
05 Feb 2024 ~06:00 - renamed attribute in the database collection. This caused other issues which were resolved by applying a hotfix to generate a new reference until a more robust solution is implemented

Root Cause(s)

We got a slack message that the ETL was running for NHS E & NIMDTA and there was no completion message
The ETL was retrying the step that creates Post Funding
The step was failing because the SQL included a field which didn’t exist
The ETL relied on some database changes which hadn’t been released to production sites yet
The ETL workflow automatically deploys unless it is cancelled in a 5 minute window
The workflow can’t have an approval step while the source is private and part of the current subscription

Action Items

...

Action Items

...

Owner

...

~~Add max retries for all steps to reduce the scope of the failure~~

...

This does have value and we may look into this in the future but dependent on whether we want to stick with spring batch

...

Spike: What are the options for adding workflow approvals? explore options for enabling GHA workflow approvals:

Using an Enterprise subscription
Make repository public

...

Sentry errors were caused by a reference which could not be found
The reference was to the change stream for the last change in the recommendation collection, which was presumably cleaned up
The change stream was cleaned up*
The change stream is configured to hold references for at least to 3 hours (default)

...

Action Items

Action Items

Owner

Comments

Investigating why the position in capped collection was deleted (see errors on weekend):

When were the last changes to recommendation records before Sat 3rd Feb?
The previous change was on Friday evening
How is the Document DB change stream configured? Is it a 7 day window?
The change stream retention period is the default 3 hours but is only removed once the log is full with an AWS configured 51,200MB. It is possible that the data represents the first time this threshold was crossed without additional writes since the cluster was last modified. Increasing the retention period will impact storage and performance.

Joseph (Pepe) Kelly

Done

Increase the change stream retention period.

Jira Legacy

server	System JIRA
serverId	4c843cd5-e5a9-329d-ae88-66091fcfe3c7
key	TIS21-5713

~~Make the resume token check work if there isn’t one…~~

Upgrade and
1. Use a time based check
2. Change the Lambda to use Change Stream events directly
3. look for how to directly integrate the change stream with SQS or other asynchronous messaging service
Check & modify DocumentDB resume token to “upsert” a new value.

Not right now. We can review this if the database is upgraded as we expect to do soon.

...

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Detection

Resolution

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Action Items

Lessons Learned