Page Comparison

...

Summary

Date	18 Nov 2020
Authors	Philip Wilsdon (Unlicensed) John Simmons (Deactivated)
Status

...

Working on it

...

Resolved
Summary	The sync jobs and ETLS failed Running of the correct ETL’s, `gmc-sync-prod` and `intrepid-reval-etl-all-prod` jobs fixed the issue. We decided not to improve monitoring to the existing sync jobs as new revalidation is soon going to be live Do need to add monitoring to the new GMC sync job and if we build any new ETLs or transformation services for new revalidation when we do this Added extra resource to the reval as it ran out of memory
Impact

Table of Contents

Non-technical summary

Reval was showing information that had not been updated from GMC, on further investigation it turned out that the jobs that run overnight to get the data from GMC and then add it into TIS had not run successfully. Once they had been fixed and rerun, Reval showed the correct information.

Timeline

18 Nov 2020 -09:48 AM	Image Modified
18 Nov 2020 - 10:25 AM	Created ticket and incident page https://hee-tis.atlassian.net/browse/TISNEW-5728

Root Causes

Accidental major version update to one of our core infrastructure tools caused a failure in a dependent tool. This was ok in itself with the containers that were running, but no new containers could launch, ie ETL’s or newly deployed software versions.

Trigger

Accidentally upgrading rather than updating the servers that the TIS infrastructure run on.

Resolution

Remove the server to be worked on from the load balancer so that all inbound TIS traffic is diverted to the working server.
Stop all docker containers from running
Restart the server so all of the upgrades/updates could apply correctly
Remove the old containers
Remove the docker networks associated with each of those containers
Applied network fixes to move the network range to a non-overlapping range that wasn't in use with the new AWS infrastructure.
Restarted all of the containers in sequence. This had to be done in series as, while this process was happening (approximately 1 hour per server), the other server in our infrastructure was in charge of keeping TIS up for everyone else.
Check all services are up and working
Allow load balancer to send traffic to the fixed server
Rinse and repeat with the second server.

Detection

NDW ETL (Prod) failure alert on Slack
Reval / GMC sync ETL’s failure alert on Slack

Actions

[insert actions to take to mitigate this happening in future]
e.g.
keep everything more up to date to avoid major impacts of upgrades in future
ensure one person is not a single point of failure - required code reviews for infrastructure changes
specific changes to the architecture (list them) to improve resilience:
- Use of ‘serverless’ technology: ECS, RDS, DocumentDB
check STAGE matches PROD upgrade

Lessons Learned (Good and Bad)

...

Good. The load-balanced infrastructure works and we were able to keep TIS mostly working while we performed the fixes.

...

Bad. More care to be taken with the commands being issued to the production server.

Repeatable playbooks, applied to non-production servers first.

...

Bad. Highlighted where we could do with more redundancy in the architecture.

Loadbalancer healthchecks aren’t exhaustive… could they be extended to match Uptime Robot or similar monitoring?

2020-11-18 Reval Legacy/Old GMC Sync
18 Nov 2020 - Between 10:25 and 11:21	Ran the jobs `intrepid-reval-etl` and `intrepid-reval-etl-all-prod` Fixed the refresh of data but difference between the under notice values between TIS legacy/existing reval and GMC Connect Image Added
18 Nov 2020 12:06	Ran the correct ETL’s (`gmc-sync-prod` and `intrepid-reval-etl-all-prod`) via Jenkins
18 Nov 2020 12:25	Problem is assumed to have been fixed Image Added

Root Causes

The Jenkins scheduled jobs had been amended to run on a different server when we had the prod outage on Friday 13th November 2020. This change had been overlooked and not rolled back. Therefore the jobs could not run afterwards as there was a conflict in the inventory (as shown in the Jenkins output for each job, and the fact that each job ran for less than 1 second).

Trigger

A user reported in Teams Support Channel that their connections had not been working correctly

Resolution

Running of the correct ETL’s, gmc-sync-prod and intrepid-reval-etl-all-prod jobs fixed the issue.
Added more memory to the Reval container

Detection

A user reported in Teams Support Channel

Actions

For new reval - we need add monitoring so we know if the sync job to get the data from GMC and if we build any ETL/Transformation service are run successfully or fail - we could use this ticket https://hee-tis.atlassian.net/browse/TISNEW-3264
Decision taken not to address the monitoring in the current reval application as the new one is pretty close to being live (December 2020)

Lessons Learned (Good and Bad)

Still limited knowledge within the existing teams about how the existing module works (which is why the rebuild is taking place)
Current monitoring requires more investment to get it to work more reliably - problems with set to fail on the first occurrence and alerting would need to be written into the individual apps rather than checking of logs
Image AddedImage Added
Jobs need to be started via Jenkins
check the jenkins jobs for 1. what they do, and 2. what the logs for that run said.

Image Added

Versions Compared

Old Version 1

New Version Current

Key

Non-technical summary

Timeline

Root Causes

Trigger

Resolution

Detection

Actions

Lessons Learned (Good and Bad)

Root Causes

Trigger

Resolution

Detection

Actions

Lessons Learned (Good and Bad)