Date |
|
Authors | |
Status | Working on itResolved |
Summary Ran the GMC SYNC ETL and INTREPID REVAL ETLs (intrepid-reval-etl and |
|
Impact |
Table of Contents |
---|
Non-technical summary
Reval was showing information that had not been updated from GMC, on further investigation it turned out that the jobs that run overnight to get the data from GMC and then add it into TIS had not run successfully. Once they had been fixed and rerun, Reval showed the correct information.
Timeline
-09:48 AM | |
- 10:25 AM | Created ticket and incident page https://hee-tis.atlassian.net/browse/TISNEW-5728 |
- Between 10:25 and 11:21 | Ran the jobs Fixed the refresh of data but difference between the under notice values between TIS legacy/existing reval and GMC Connect |
12:06 | Ran the GMC Sync correct ETL’s ( |
12:25 | Seems to be Problem is assumed to have been fixed |
Root Causes
???
Trigger
User The Jenkins scheduled jobs had been amended to run on a different server when we had the prod outage on Friday 13th November 2020. This change had been overlooked and not rolled back. Therefore the jobs could not run afterwards as there was a conflict in the inventory (as shown in the Jenkins output for each job, and the fact that each job ran for less than 1 second).
Trigger
A user reported in Teams Support Channel that their connections had not been working correctly
Resolution
Running of the jobs
intrepid-reval-etl
correct ETL’s,gmc-sync-prod
andintrepid-reval-etl-all-prod
jobs fixed the issue.Added more memory to the Reval container
Detection
User A user reported in Teams Support Channel
Actions
For new reval - we need add monitoring so we know if the sync jobs and ETLS job to get the data from GMC and if we build any ETL/Transformation service are run successfully or fail - we could use this ticket https://hee-tis.atlassian.net/browse/TISNEW-3264
Decision taken not to address the monitoring in the current reval application as the new one is pretty close to being live (December 2020)
Lessons Learned (Good and Bad)
Still limited knowledge within the existing teams about how the existing module existsNeed monitoring - we raised this on August 22, 2019 and still have not done itworks (which is why the rebuild is taking place)
Current monitoring requires more investment to get it to work more reliably - problems with set to fail on the first occurrence and alerting would need to be written into the individual apps rather than checking of logs
Jobs need to be run started via Jenkins
intrepid-reval-etl builds the ETL?check the jenkins jobs for 1. what they do, and 2. what the logs for that run said.