Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Date

Authors

Reuben Roberts Andy Dingley

Status

In progress

Summary

https://hee-tis.atlassian.net/browse/TIS21-2191

Impact

NDW downstream processes compromised, including Hicom Leave Manager data transfer

Non-technical Description

The overnight job to transfer data to the NDW for the TIS production, staging and NIMDTA environments failed.


Trigger

  • Breaking change in third-party component.


Detection

  • Slack monitoring alert.


Resolution

  • Rolled-back the TIS-NDW-ETL instances to the last known-good version.

  • Investigation into the root cause discovered the breaking-change in the 3rd party image builder for the component

  • Pending a roll-out of a patch, upgrading the TIS-NDW-ETL from Java 8 to Java 11 eliminated the issue


Timeline

  • 03:00 - 03:40 BST #monitoring-ndw Slack channel alert that the first of three overnight jobs failed.

  • 07:53 BST Those jobs were restarted.

  • 09:11 BST All jobs had completed successfully.

Root Cause(s)

  • A new release of the TIS-NDW_ETL component was deployed the previous afternoon

  • Two assumptions triggered the subsequent failure of the ETL jobs:

    • That merging the pull request would only deploy the revised component to the staging environment

    • That no breaking-changes in 3rd party component would cause the Docker images to be invalid

  • In fact, the new version of the component is deployed to all three environments (staging, production and NIMDTA). In general this should be acceptable, since the job is only run at 2am, so there should be scope to test the new deployment on staging by force-running the task, with time to roll back if necessary before the nightly scheduled run.

  • A breaking change to the Paketo image builder rendered the new component images invalid, throwing the following error on startup:
    2021-09-29T02:40:40.620Z # JRE version: (8.0_302-b08) (build )
    2021-09-29T02:40:40.621Z # Java VM: OpenJDK 64-Bit Server VM (25.302-b08 mixed mode linux-amd64 compressed oops)
    2021-09-29T02:40:40.621Z # Problematic frame:
    2021-09-29T02:40:40.623Z # V [libjvm.so+0xb2a2f7] VirtualMemoryTracker::add_reserved_region(unsigned char*, unsigned long, NativeCallStack const&, MemoryType, bool)+0x77
    2021-09-29T02:40:40.623Z #
    2021-09-29T02:40:40.623Z # Core dump written. Default location: /workspace/core or core.1
    The fix is documented here: https://github.com/paketo-buildpacks/libjvm/pull/105

  • Workarounds to (a) temporarily force the specific (older) version of the builder or (b) to migrate the component to Java 11 were explored.
    (a) Amending the pom.xml file as per:
    <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <image> <builder>gcr.io/paketo-buildpacks/builder:0.1.174-base</builder> </image> </configuration>

    was successful, but would need to be reverted once a new working version of the 3rd party component was released.
    (b) With concomitant version changes for groovy etc., this was successful, and was the preferred method. This was implemented and a revised functional image of the component deployed 2021-09-30 8:30 AM BST


Action Items

Action Items

Owner

Rebuild tis-ndw-etl using Java 11

Reuben Roberts

Done

Possibly update component documentation to flag-up deployment process (to mitigate against future assumptions causing this sort of issue)

Reuben Roberts


Lessons Learned

  • Assumptions as to the CI/CD pathway for a specific component can be problematic.

  • It would be useful to flag-up non-standard deployment paths in component documentation

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.