In progress

DateneDate	29 Sep 2021
Authors	Reuben Roberts Andy Dingley ?
Status	Done
Summary	https://hee-tis.atlassian.net/browse/TIS21-2191
Impact	NDW downstream processes compromised, including Hicom Leave Manager data transfer

Table of Contents

Non-technical Description

The overnight jobs failed. Query TIS Prod was not available at the time they ran?

Trigger

...

job to transfer data to the NDW for the TIS production, staging and NIMDTA environments failed. This in turn triggered the failure of the NDW ETL process, and consequently the data transfer to Hicom Leave Manager. Due to further oversight, the same issue recurred the following night. In both instances, the TIS and NDW jobs were rerun, but Hicom only imports data once a day, so this system would have been left with data two days stale. The following graphic indicates the downstream effects, particularly the ‘Hicom Leave Manager Server’ swimlane: ETLs on Swim-lane .

...

Trigger

Breaking change in third-party component, and subsequent misunderstanding that the initial job rerun was not a fix to the underlying problem per se.

...

Detection

Slack monitoring alert.

...

Resolution

Rerun the jobs.
Detect why they failed.
Fix that.Reran the NDW-ETL processes using the last known-good version.
Investigation into the root cause discovered the breaking-change in the 3rd party image builder for the component
Pending a roll-out of a patch, upgrading the TIS-NDW-ETL from Java 8 to Java 11 resolved the issue

...

Timeline

29 Sep 2021 03:00 - 03:40 BST #monitoring-ndw Slack channel alert that the first of three overnight jobs failed.
29 Sep 2021 07:53 BST Those jobs were restarted.
29 Sep 2021 09:11 BST All jobs had completed successfully.
29 Sep 2021 09:13 BST Pavel at NDW notified
29 Sep 2021 12:00 BST Pavel indicates that the NDW ETL has been run successfully
30 Sep 2021 03:00 - 03:40 BST #monitoring-ndw Slack channel alert that the first of three overnight jobs failed (unfortunately the broken task images were still ‘live’ for the services)
30 Sep 2021 06:30 BST These jobs were restarted.
30 Sep 2021 07:30 BST All jobs had completed successfully.
30 Sep 2021 07:40 BST Pavel at NDW notified
30 Sep 2021 09:06 BST The revised component deployed successfully, and tested on staging
30 Sep 2021 11:04 BST Pavel indicates that the NDW ETL has been run successfully

Root Cause(s)

A new release of the TIS-NDW_ETL component was deployed the previous afternoon
Two assumptions triggered the subsequent failure of the ETL jobs:
- That merging the pull request would only deploy the revised component to the staging environment
- That no breaking-changes in 3rd party component would cause the Docker images to be invalid
In fact, the new version of the component is deployed to all three environments (staging, production and NIMDTA). In general this should be acceptable, since the job is only run at 2am, so there should be scope to test the new deployment on staging by force-running the task, with time to roll back if necessary before the nightly scheduled run.
A breaking change to the Paketo image builder rendered the new component images invalid, throwing the following error on startup:
2021-09-29T02:40:40.620Z # JRE version: (8.0_302-b08) (build )
2021-09-29T02:40:40.621Z # Java VM: OpenJDK 64-Bit Server VM (25.302-b08 mixed mode linux-amd64 compressed oops)
2021-09-29T02:40:40.621Z # Problematic frame:
2021-09-29T02:40:40.623Z # V [libjvm.so+0xb2a2f7] VirtualMemoryTracker::add_reserved_region(unsigned char*, unsigned long, NativeCallStack const&, MemoryType, bool)

...

TBC.+0x77
2021-09-29T02:40:40.623Z #
2021-09-29T02:40:40.623Z # Core dump written. Default location: /workspace/core or core.1
The fix is documented here: https://github.com/paketo-buildpacks/libjvm/pull/105
Workarounds to (a) temporarily force the specific (older) version of the builder or (b) to migrate the component to Java 11 were explored.
(a) Amending the pom.xml file as per:
<plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <image> <builder>gcr.io/paketo-buildpacks/builder:0.1.174-base</builder> </image> </configuration>
was successful, but would need to be reverted once a new working version of the 3rd party component was released.
(b) With concomitant version changes for groovy etc., this was successful, and was the preferred method. This was implemented and a revised functional image of the component deployed 2021-09-30 8:30 AM BST

...

Action Items

Action Items	Owner	Status
Rebuild tis-ndw-etl using Java 11	Reuben Roberts	Done
Possibly update component documentation to flag-up deployment process (to mitigate against future assumptions causing this sort of issue)	Reuben Roberts	Done
Review previous NDW-ETL failures to highlight in particular partial-runs, since these cause most disruption in downstream systems (which interpret missing data as having been deleted)	Reuben Roberts	Done: Text is as follows: Here is a quick summary of the NDW ETL failures. This year, we have had 5 failures (if you count the latest one as a single failure spread over two days). Last year there were 6. Of the failures this year, only one appears to have been a partial failure: 2021-03-10: 2021-03-10 The NDW ETL failed and didn't recover for PROD & STAGE This seems to have been triggered by an outage on the NDW side. I guess this suggests a few things to me: Failures of the existing NDW ETL are not particularly rare, but partial failures are. The incidence of failures has not dropped much since last year, so the various actions/interventions in response to the failures have not (yet) had much impact. Possibly there is simply quite a diverse range of situations that can cause failure. Mitigating against complete failures, e.g. by better notifications to downstream parties and/or automated retrying of the jobs, would be likely to see active service, but as the 'cost' of data that is one day stale is not so high, the benefit might be marginal. Ensuring that the job either completes successfully or fails entirely (or prevents downstream processes from reading the partially populated database by e.g. removing the credentials whereby their process connects to the database at the start and recreating them only at the successful completion of the job) might be more valuable, even though it covers a less likely scenario.
Further planning / ticketing work to mitigate against this in future		Questions exist over NDW capacity to implement their end of a revised queue / stream of changes that we send, instead of the existing monolithic process. However, TIS could modernise our side with a new-to-old interface of sorts, as per the ESR BiDi project.

...

Lessons Learned

...

Assumptions as to the CI/CD pathway for a specific component can be problematic.
It would be useful to flag-up non-standard deployment paths in component documentation
As Hicom is an external company, there is potentially less flexibility in rescheduling jobs, and general negotiations would need to occur within a contractual framework. This also underlines the need for a consistent reliable service from the TIS side.

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Non-technical Description

Trigger

Trigger

Detection

Resolution

Timeline

Root Cause(s)

Action Items

Lessons Learned