Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Date

 

AuthorsGrante Marshall (Unlicensed) Graham O'Regan (Unlicensed)
StatusComplete
SummaryThe Intrepid ETL process ran on production but failed due an error in the SQL query. The Elasticsearch snapshot restore also failed which left us unable to update the index for the day. We checked with Joanne Watson (Unlicensed) , the service manager, to see if it would impact the pilot but it wasn't being used so we didn't request access from Hicom to the DR to resolve the issue.
Impactthe service wasn't usable for the day.

Root Cause

A SQL query was referencing the test DR schema. Once we detected that the process had failed we checked the configuration of the Docker containers but quickly realised that a Docker image had updated on production so pre-production code was released.

Trigger

The nightly Intrepid ETL ran and failed. 

Resolution

Fixed the versions of the containers in our configuration.

Detection

After the issues with Hicom's DR run on   the team checked the service the following morning. Alex Dobre (Unlicensed) discovered the issue by looking at the container log files.

Action Items

Action ItemTypeOwnerIssue
Create single config file for container versionspreventGraham O'Regan (Unlicensed)

TISDEV-1445 - Getting issue details... STATUS

Pin versions of containers in stage and prodpreventGraham O'Regan (Unlicensed)

TISDEV-1475 - Getting issue details... STATUS

Timeline

  • The etl-prod job ran at 3am
  • The team checked the service at 9am
  • The version of the container for the Intrepid ETL was pinned by Graham O'Regan (Unlicensed) at 9pm ahead of the next morning's run.

Supporting Information


  • No labels