2021-07-21 GMC API Failure

Date

Jul 21, 2021

Authors

@Adewale Adekoya @Jayanta Saha @John Simmons (Deactivated)

Status

Done

Summary

 

Impact

Reval Users cannot manage trainees using the current Reval App

Non-technical Description

GMC was inadvertently blocking our connection to their production environment.

What was the issue?

When trying to connect to the GMC database for our nightly synchronisation ETL, our application was getting blocked by the firewall at the GMC, we contacted the GMC to highlight the issue after double-checking our infrastructure to make sure it wasn't anything we were responsible for. They confirmed their oversight, confirming that they had implemented some additional security on 20/07/2021 due to some nefarious activity that was hitting their service. That security update was a little too zealous and blocked our connection to the API. On 21/07/2021 at 11am, GMC added the HEE IP addresses to a ‘whitelist’ which resolved the problem of us connecting to them, but still keeps the unwanted traffic away.


Trigger

Additional security had been added to Cloudflare's firewall due to them being attacked the previous day. This blocked HEE access to their API.


Detection

Alert in monitoring channel:

 


Resolution

Comms with GMC who stated:

We then re-ran the GMC sync and the associated ETL’s and all responded well.

 


Timeline

Jul 21, 2021: 09:40 - Joseph Kelly noticed issue in monitoring channel

Jul 21, 2021: 10:26 - Katy raised issue on Teams

Jul 21, 2021: 10:26 - John raised issue on Slack

Jul 21, 2021: 11:33 - Ade raised issue with GMC

Jul 21, 2021: 11:55 - GMC requested for more details

Jul 21, 2021: 12:35 - Ade supplied more details

Jul 21, 2021: 14:16 - GMC emailed issue resolved

Root Cause(s)

Firewall update at GMC that was a little restrictive and therefore blocked a lot of connections, including ours.

Lessons Learned

No Lessons learned as the problem was completely at the GMC end, and there isn't anything we could have put in place to mitigate this.

As we’d noticed the issue via our monitoring, we arguably should have alerted Reval Admins in Teams before they highlighted the problem. And then regularly updated progress/resolution (for this incident, the turnaround was very quick anyway, as it goes).