Date	21 Dec 2016
Authors	Grante Marshall (Unlicensed) Graham O'Regan (Unlicensed)
Status	Complete
Summary	The Intrepid ETL process ran on production but failed due an error in the SQL query. The Elasticsearch snapshot restore also failed which left us unable to update the index for the day. We checked with Joanne Watson (Unlicensed) , the service manager, to see if it would impact the pilot but it wasn't being used so we didn't request access from Hicom to the DR to resolve the issue.
Impact	the service wasn't usable for the day.

Root Cause

A SQL query was referencing the test DR schema. Once we detected that the process had failed we checked the configuration of the Docker containers but quickly realised that a Docker image had updated on production so pre-production code was released.

Trigger

The nightly Intrepid ETL ran and failed.

Resolution

Fixed the versions of the containers in our configuration.

Detection

After the issues with Hicom's DR run on 20 Dec 2016 the team checked the service the following morning. Alex Dobre (Unlicensed) discovered the issue by looking at the container log files.

Action Items

Action Item	Type	Owner	Issue
Create single config file for container versions	prevent	Graham O'Regan (Unlicensed)	TISDEV-1445 - Getting issue details... STATUS
Pin versions of containers in stage and prod	prevent	Graham O'Regan (Unlicensed)	TISDEV-1475 - Getting issue details... STATUS

Timeline

The etl-prod job ran at 3am
The team checked the service at 9am
The version of the container for the Intrepid ETL was pinned by Graham O'Regan (Unlicensed) at 9pm ahead of the next morning's run.

Supporting Information

etl-prod#54

2016-12-21 Intrepid ETL Failure