2017-09-22 Premature DNS change-over
Date | 2017-09-22 |
Authors | Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | The production server switched over to a new platform before we had originally planned with the NHS Digital team |
Impact | Any work that was completed on the new platform had to be repeated on the old platform after the switch was reverted. |
Root Cause
The DNS change request was misinterpreted and the switch was made on receipt of an email from Graham O'Regan (Unlicensed) at 09:48 instead of after 16:00.
Trigger
We requested that the DNS should be swapped after 16:00 on Friday 22nd on the understanding that it would take a further 24 hours for the TTL to expire and for the change to be visible. However, in an effort to clarify that the change would go ahead on Friday and would take time to propagate, the request was misinterpreted and the change took effect straightaway.
Resolution
NHS Digital engineer reverted the change so we pointed back to the old platform.
Detection
TIS developers noticed that they were no longer pointing at the correct development platform which prompted the team to check all the changes.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Reschedule DNS change with 24hr TTL to 16:00 | prevent | Chris Mills (Unlicensed) |
Timeline
11:25 Panos Paralakis (Unlicensed) reported that he couldn't see the dev environment.
11:41 Fayaz Abdul (Unlicensed) noticed that the records had changed
12:35 Graham O'Regan (Unlicensed) confirmed that the production app server was also affected.
12:45 Graham O'Regan (Unlicensed) stopped the web server on the new platform to prevent further updates.
13:09 Chris Mills (Unlicensed) requested that the changes were reverted.
13:17 NHS Digital confirmed that the change was complete
Supporting Information
panos [11:25 AM] I can not access dev at all [11:25] Internal server error [11:25] Anyway can someone tell me about Training numbers search param please? [11:26] is it query or searchQuery ? [11:26] In dev ui it does not exist at all [11:28] tcs/api/training-numbers?page=0&size=200&sort=asc&query=1 [11:28] or tcs/api/training-numbers?page=0&size=200&sort=asc&searchQuery=1 ? [11:29] @here :point_up_2: alex.dobre [11:36 AM] @panos looking at dev with fayaz - will try to get the training numbers swagger up on UI DEV as well panos [11:36 AM] KK thanks! fayaz [11:41 AM] it seems dns records change took into effect [11:42] ;; ANSWER SECTION: dev-apps.tis.nhs.uk. 300 IN A 52.174.60.195 [11:42] thats the ip of our new dev [11:42] old devs ip is 52.166.148.74 [11:43] @chrism - but the change was scheduled after 4:30pm right? chrism [11:43 AM] Yeah that's what he said. [11:47] Looks like what they've done is lowered the TTL as asked. Then when Graham sent that email this morning asking for the first one. They've just gone ahead and done it graham [12:35 PM] @fayaz @chrism so have we swapped?! [12:37] @fayaz looks like they did, did you copy the data over last night? chrism [12:43 PM] I've asked him to revert the DNS changes as the TTL is low now. graham [12:44 PM] k, looks like it flushed almost straight away and not after the 24hr ttl expired :confused: chrism [12:44 PM] We'd planned that we'd request them for 4:30 so had time to sort over weekend but he got confused it seems and did a bit of both requests. [12:45] I'm not sure how they changed if they were `scheduled for 4:30` though. graham [12:45 PM] i’ve stopped apache on the new prod server to prevent any further requests to it emanuele [12:52 PM] @graham @chrism I may have missed some, however UI-DEV has issues too… [12:53] 10.110.0.136 graham [12:53 PM] we’ll look at that later, need to get live back up and running, they are reverting the dns change but i’m stilling the new value [12:54] @channel we’re going to ahve to copy the connection discrepancies db from the new server to the old one, i can see a bunch of audit events from andy petherbridge [12:57] @fayaz can you merge master and terraform? trying to flip between branches is a nightmare right now fayaz [12:58 PM] @graham - will do it, just want to be sure we don’t want to run any playbooks graham [12:59 PM] also, can you add my key? i can’t see to be able to get to the app server chrism [12:59 PM] I'll sort my conflicts out after fayaz. I realised the the platform didn't work how I thought it did. graham [12:59 PM] @chrism guess your change for GMC isn’t live so andy couldn’t have made any changes to GMC connections? [12:59] kk chrism [1:00 PM] Nope it's not graham [1:00 PM] ok, should be ok then, i’ll speak to him to get him to repeat that work when we’re finished [1:02] @here so right now we’re waiting for the DNS to revert back to the old platform, nothing else to do until then fayaz [1:04 PM] @graham: its already added, jump through new-jenkins [1:04] https://hee-tis.atlassian.net/wiki/spaces/TISDEV/pages/95748197/New+VNETs graham [1:04 PM] hmm, heetis@HEE-TIS-VM-JENKINS:~$ ssh heetis@10.170.0.132 Permission denied (publickey). fayaz [1:05 PM] looking at it fayaz [1:14 PM] @here: please hold of your commits or merges to TIS-DEVOPS panos [1:22 PM] @fayaz Can I push to dev ? fayaz [1:22 PM] yes @panos panos [1:22 PM] kk thanks! fayaz [1:22 PM] @here: after 4pm today please don’t push anything to any environment [1:22] I will be stopping the jenkins service at 4pm [1:22] and prep for the migration chrism [1:23 PM] It should be back on the old DNS now
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213