...
Date | |
Authors | John Simmons (Deactivated), Jayanta Saha, Ashley Ransoo |
Status | ETL stabilised |
Summary | 1 of 5 processes errored. |
Impact | Need to manually copy files generated by our ESR ETL and move them to their `FTP In` folder in Azure |
Table of Contents |
---|
Impact
- One of the 5 processes forming the ESR ETL was not completing properly. Therefore some data was not syncing to the
FTP In
folder in Azure.
Root Causes
- The version number of TCS changed (which it hasn't done in many months). The 1 of the 5 ESR ETL processes looks for a specific version of TCS and couldn't find it.
Trigger
- The new monitoring and alerting system for the ESR ETL picked up the failures and sent alerts into the #esr-operations slack channel.
Resolution
- On the Friday, the root cause was found, and we altered the ESR ETL TCS version reference to ensure it started working again. Ash on the following Monday then identified the affected files and manually placed them in the folder. They were picked up in the following ESR ETL run and normal service was restored.
Detection / Timeline
- Tue 19 Feb: 15.30 - Alert in #esr_operations channel in Slack. Assumption was that network problems were rendering he ESR service unavailable.
- Wed 20 Feb: 15.30 - 2nd alert. Fuller investigation started. Bu no obvious reason for the problem.
- Thu 21 Feb: 15.30 - 3rd alert. Ash indicated a change in TCS service might be the issue. John narrowed it down to some work on Dr Vacant posts. Paul was called in to pinpoint. Ash raised the need to retrospectively process those files that were missed since 19th.
- Fri 22 Feb: am - Paul pinpointed the error and fixed the ESR ETL.
- Fri 22 Feb: 15.50 Alert to channel confirmed the ETL processed correctly again.
- Wed 27 Feb: Sprint Planning - ticket TISNEW-2716 "ESR - Processing RMC files backlog due to ETL failure from 19/02 to 21/02" added to Sprint 69 (2019-02-27 - 2019-03-13).
Action Items
- Process RMC files.
- Investigate making the TCS version reference in the ESR ETL dynamic (look for the 'latest' version of TCS rather than a specific version).
Lessons Learned
- The current implementation of the esr-ftp-sync is horrible, a better solution needs to be found. Further talks with IT regarding the underlying networking will be needed.
What went well
- Ashley's knowledge of the whole ESR system was invaluable to work out what was happening and when. Added to Jay's knowledge of ESR root cause analysis and John's increasing knowledge from an ops perspective enabled the problem to be isolated and corrected.
What went wrong
- Not enough specific logging / monitoring of the ESR ETL to enable rapid identification of the specific problem.