2019-08-12 GMC Sync and Interpid-Reval-etl-all-prod failures
Date | |
Authors | |
Status | GMC Sync and Interpid-Reval-ETL failures from 9th August to 13th August |
Summary |
Both the GMC sync and Intrepid reval ETL failed to run from 9th of August (Friday evening) to 13th of August. Although no errors with the Jenkins job itself, the log when inspected gave an error of "no hosts matched" in both ETLs. |
Impact |
Reval data on the TIS Reval Application went out of sync and therefore missing doctors that are due for revalidation for several days. |
Impact
No new data in the Reval App
Root Causes
GMC-SYNC and Interpid-dr-etl docker containers were not synced to new ACR location- GMC SYC and Intrepid Reval ETLs did not run successfully
Trigger
- Joanne Watson (Unlicensed) and Alistair Pringle (Unlicensed) raised this as a potential #fire_fire issue on slack dev channel
Resolution
Copied the missing docker images/manifests from old repo to new repo and reran the jenkins jobs- John Simmons (Deactivated) - need your input here.....
Detection / Timeline
- 2019-08-12 11:12 User reported on MS Teams on Monday that '..issue with the API between TIS and GMC as we have 9 under notice on TIS for 30.11.2019 but far more than this on GMC connect and for Dec 2019 1 on TIS but 11 on GMC connect'
- 2019-08-12 11:50 Jay volunteered to look into this and asked for support from the team. Pepe and Jay started an investigation. Not sure what was looked at this point and lots of other dicussions about whether this is a fire_fire issue and whether reval in itself is a priority.
- 2019-08-13 12:07 This is the time by which all the chit chats, Question Time and other political discussions had finished without a clear conclusion. About an hour in total since raised.
- 2019-08-13 16:45 BA (Ashley) investigation started in order to raise a ticket into our backlog
- 2019-08-13 17:00 Ashley found an error message on both the GMC SYNC and INTREPID REVAL ETLs - '...no hosts matched...' and raised this with the PO who concurred it's a fire_fire issue.
- 2019-08-13 17:14 Ashley posted a reply to the thread on slack dev channel alerting the opsteam that this is a potential infrastructure issue and not a bug in the application code itself that requires a dev to fix.
- 2019-08-13 17:23 John kindly stepped in and started looking into a fix. (John Simmons (Deactivated) please add details of the fix here/to the ticket)
- 2019-08-13 17:35 Fix applied and GMC SYNC re-run which completed successfully
- 2019-08-13 17:42 REVAL ETL re-run and completed successfully
- 2019-08-13 17:52 PO's alerted by John of the fix.
- Need to check in the morning of 14/08 with the users if this has resolved the reported issue.
Lessons Learned
Should have taken ETL's into consideration of docker images that needed moving to new repo- Impact of the ETLs not running should have been considered and look at resolving this as a fire_fire issue
- Improvement to the slack monitoring of the GMC SYNC ETL and INTREPID REVAL ETL needs to be looked at including more information about the Ansible failure into the slack notification.
- - TISNEW-3264Getting issue details... STATUS
What went well
- Once I understood the reval process, (documented here Revalidation - Application Architecture Review), it became apparent that the issue is likely to be with the GMC SYNC and/or REVAL ETLs
- Fast fix by John when pointed out about the error, and re-run of etl's meaning that the correct data was in prod before the start of the next working day.
What went wrong
- Shouldn't have missed the etl's in the first place for 3 days not running and leaving it uninvestigated given the impact.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213