2021-07-21 GMC API Failure
Date | Jul 21, 2021 |
Authors | @Adewale Adekoya @Jayanta Saha @John Simmons (Deactivated) |
Status | Done |
Summary |
|
Impact | Reval Users cannot manage trainees using the current Reval App |
Non-technical Description
GMC was inadvertently blocking our connection to their production environment.
What was the issue?
When trying to connect to the GMC database for our nightly synchronisation ETL, our application was getting blocked by the firewall at the GMC, we contacted the GMC to highlight the issue after double-checking our infrastructure to make sure it wasn't anything we were responsible for. They confirmed their oversight, confirming that they had implemented some additional security on 20/07/2021 due to some nefarious activity that was hitting their service. That security update was a little too zealous and blocked our connection to the API. On 21/07/2021 at 11am, GMC added the HEE IP addresses to a ‘whitelist’ which resolved the problem of us connecting to them, but still keeps the unwanted traffic away.
Trigger
Additional security had been added to Cloudflare's firewall due to them being attacked the previous day. This blocked HEE access to their API.
Detection
Alert in monitoring channel:
Resolution
Comms with GMC who stated:
We then re-ran the GMC sync and the associated ETL’s and all responded well.
Timeline
Jul 21, 2021: 09:40 - Joseph Kelly noticed issue in monitoring channel
Jul 21, 2021: 10:26 - Katy raised issue on Teams
Jul 21, 2021: 10:26 - John raised issue on Slack
Jul 21, 2021: 11:33 - Ade raised issue with GMC
Jul 21, 2021: 11:55 - GMC requested for more details
Jul 21, 2021: 12:35 - Ade supplied more details
Jul 21, 2021: 14:16 - GMC emailed issue resolved
Root Cause(s)
Firewall update at GMC that was a little restrictive and therefore blocked a lot of connections, including ours.
Lessons Learned
No Lessons learned as the problem was completely at the GMC end, and there isn't anything we could have put in place to mitigate this.
As we’d noticed the issue via our monitoring, we arguably should have alerted Reval Admins in Teams before they highlighted the problem. And then regularly updated progress/resolution (for this incident, the turnaround was very quick anyway, as it goes).
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213