Date |
|
Authors | Grante Marshall (Unlicensed) Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | The Intrepid ETL process ran on production but failed due an error in the SQL query. The Elasticsearch snapshot restore also failed which left us unable to update the index for the day. We checked with Joanne Watson (Unlicensed) , the service manager, to see if it would impact the pilot but it wasn't being used so we didn't request access from Hicom to the DR to resolve the issue. |
Impact | the service wasn't usable for the day. |
Root Cause
A SQL query was referencing the test DR schema. Once we detected that the process had failed we checked the configuration of the Docker containers but quickly realised that a Docker image had updated on production so pre-production code was released.
Trigger
The nightly Intrepid ETL ran and failed.
Resolution
Fixed the versions of the containers in our configuration.
Detection
After the issues with Hicom's DR run on the team checked the service the following morning. Alex Dobre (Unlicensed) discovered the issue by looking at the container log files.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Create single config file for container versions | prevent | Graham O'Regan (Unlicensed) | |
Pin versions of containers in stage and prod | prevent | Graham O'Regan (Unlicensed) |
Timeline
- The etl-prod job ran at 3am
- The team checked the service at 9am
- The version of the container for the Intrepid ETL was pinned by Graham O'Regan (Unlicensed) at 9pm ahead of the next morning's run.
Add Comment