2019-08-12 GMC Sync and Interpid-Reval-etl-all-prod failures

Date	12 Aug 2019
Authors	Ashley Ransoo , Alistair Pringle (Unlicensed), John Simmons (Deactivated)
Status	GMC Sync and Interpid-Reval-ETL failures from 9th August to 13th August
Summary	~~??? Docker wasn't able to run the services as the containers didn't exist in the correct location~~ Both the GMC sync and Intrepid reval ETL failed to run from 9th of August (Friday evening) to 13th of August. Although no errors with the Jenkins job itself, the log when inspected gave an error of "no hosts matched" in both ETLs.
Impact	~~No current information in Reval~~ Reval data on the TIS Reval Application went out of sync and therefore missing doctors that are due for revalidation for several days.

Impact

No new data in the Reval App

Root Causes

~~GMC-SYNC and Interpid-dr-etl docker containers were not synced to new ACR location~~
GMC SYC and Intrepid Reval ETLs did not run successfully

Trigger

Joanne Watson (Unlicensed) and Alistair Pringle (Unlicensed) raised this as a potential #fire_fire issue on slack dev channel

Resolution

~~Copied the missing docker images/manifests from old repo to new repo and reran the jenkins jobs~~
John Simmons (Deactivated) - need your input here.....

Detection / Timeline

2019-08-12 11:12 User reported on MS Teams on Monday that '..issue with the API between TIS and GMC as we have 9 under notice on TIS for 30.11.2019 but far more than this on GMC connect and for Dec 2019 1 on TIS but 11 on GMC connect'
2019-08-12 11:50 Jay volunteered to look into this and asked for support from the team. Pepe and Jay started an investigation. Not sure what was looked at this point and lots of other dicussions about whether this is a fire_fire issue and whether reval in itself is a priority.
2019-08-13 12:07 This is the time by which all the chit chats, Question Time and other political discussions had finished without a clear conclusion. About an hour in total since raised.
2019-08-13 16:45 BA (Ashley) investigation started in order to raise a ticket into our backlog
2019-08-13 17:00 Ashey found an error message on both the GMC SYNC and INTREPID REVAL ETLs - '...no hosts matched...' and raised this with the PO who concurred it's a fire_fire issue.
2019-08-13 17:14 Ashley posted a reply to the thread on slack dev channel alerting the opsteam that this is a potential infrastructure issue and not a bug in the application code itself that requires a dev to fix.
2019-08-13 17:23 John kindly stepped in and started looking into a fix. (John Simmons (Deactivated) please add details of the fix here/to the ticket)
2019-08-13 17:35 Fix applied and GMC SYNC re-run which completed successfully
2019-08-13 17:42 REVAL ETL re-run and completed successfully
2019-08-13 17:52 PO's alerted by John of the fix.
Need to check in the morning of 14/08 with the users if this has resolved the reported issue.

Lessons Learned

~~Should have taken ETL's into consideration of docker images that needed moving to new repo~~
Impact of the ETLs not running should have been considered and look at resolving this as a fire_fire issue
Improvement to the slack monitoring of the GMC SYNC ETL and INTREPID REVAL ETL needs to be looked at including more information about the Ansible failure into the slack notification. - John Simmons (Deactivated) , I this might need a ticket.

What went well

Fast fix when pointed out about the error, and re-run of etl's meaning that the correct data was in prod before the start of the working day.

What went wrong

Shouldn't have missed the etl's in the first place for 3 days and leaving it uninvestigated given the impact.