2017-09-22 Premature DNS change-over

Date2017-09-22
AuthorsGraham O'Regan (Unlicensed)
StatusComplete
SummaryThe production server switched over to a new platform before we had originally planned with the NHS Digital team
ImpactAny work that was completed on the new platform had to be repeated on the old platform after the switch was reverted.

Root Cause

The DNS change request was misinterpreted and the switch was made on receipt of an email from Graham O'Regan (Unlicensed) at 09:48 instead of after 16:00.

Trigger

We requested that the DNS should be swapped after 16:00 on Friday 22nd on the understanding that it would take a further 24 hours for the TTL to expire and for the change to be visible. However, in an effort to clarify that the change would go ahead on Friday and would take time to propagate, the request was misinterpreted and the change took effect straightaway.

Resolution

NHS Digital engineer reverted the change so we pointed back to the old platform.

Detection

TIS developers noticed that they were no longer pointing at the correct development platform which prompted the team to check all the changes.

Action Items

Action ItemTypeOwnerIssue
Reschedule DNS change with 24hr TTL to 16:00preventChris Mills (Unlicensed)

Timeline

11:25 Panos Paralakis (Unlicensed) reported that he couldn't see the dev environment.

11:41 Fayaz Abdul (Unlicensed) noticed that the records had changed

12:35 Graham O'Regan (Unlicensed) confirmed that the production app server was also affected.

12:45 Graham O'Regan (Unlicensed) stopped the web server on the new platform to prevent further updates.

13:09 Chris Mills (Unlicensed) requested that the changes were reverted.

13:17 NHS Digital confirmed that the change was complete

Supporting Information

panos [11:25 AM] 
I can not access dev at all


[11:25] 
Internal server error


[11:25] 
Anyway can someone tell me about Training numbers search param please?


[11:26] 
is it query or searchQuery ?


[11:26] 
In dev ui it does not exist at all


[11:28] 
tcs/api/training-numbers?page=0&size=200&sort=asc&query=1


[11:28] 
or tcs/api/training-numbers?page=0&size=200&sort=asc&searchQuery=1 ?


[11:29] 
@here :point_up_2:


alex.dobre [11:36 AM] 
@panos looking at dev with fayaz - will try to get the training numbers swagger up on UI DEV as well


panos
[11:36 AM] 
KK thanks!


fayaz [11:41 AM] 
it seems dns records change took into effect


[11:42] 
;; ANSWER SECTION:
dev-apps.tis.nhs.uk.    300    IN    A    52.174.60.195


[11:42] 
thats the ip of our new dev


[11:42] 
old devs ip is 52.166.148.74


[11:43] 
@chrism - but the change was scheduled after 4:30pm right?


chrism [11:43 AM] 
Yeah that's what he said.


[11:47] 
Looks like what they've done is lowered the TTL as asked. Then when Graham sent that email this morning asking for the first one. They've just gone ahead and done it


graham [12:35 PM] 
@fayaz @chrism so have we swapped?!


[12:37] 
@fayaz looks like they did, did you copy the data over last night?


chrism [12:43 PM] 
I've asked him to revert the DNS changes as the TTL is low now.


graham [12:44 PM] 
k, looks like it flushed almost straight away and not after the 24hr ttl expired :confused:


chrism [12:44 PM] 
We'd planned that we'd request them for 4:30 so had time to sort over weekend but he got confused it seems and did a bit of both requests.


[12:45] 
I'm not sure how they changed if they were `scheduled for 4:30` though.


graham [12:45 PM] 
i’ve stopped apache on the new prod server to prevent any further requests to it


emanuele [12:52 PM] 
@graham @chrism I may have missed some, however UI-DEV has issues too…


[12:53] 
10.110.0.136


graham [12:53 PM] 
we’ll look at that later, need to get live back up and running, they are reverting the dns change but i’m stilling the new value


[12:54] 
@channel we’re going to ahve to copy the connection discrepancies db from the new server to the old one, i can see a bunch of audit events from andy petherbridge


[12:57] 
@fayaz can you merge master and terraform? trying to flip between branches is a nightmare right now


fayaz [12:58 PM] 
@graham - will do it, just want to be sure we don’t want to run any playbooks


graham [12:59 PM] 
also, can you add my key? i can’t see to be able to get to the app server


chrism [12:59 PM] 
I'll sort my conflicts out after fayaz. I realised the the platform didn't work how I thought it did.


graham [12:59 PM] 
@chrism guess your change for GMC isn’t live so andy couldn’t have made any changes to GMC connections?


[12:59] 
kk


chrism [1:00 PM] 
Nope it's not


graham [1:00 PM] 
ok, should be ok then, i’ll speak to him to get him to repeat that work when we’re finished


[1:02] 
@here so right now we’re waiting for the DNS to revert back to the old platform, nothing else to do until then


fayaz [1:04 PM] 
@graham: its already added, jump through new-jenkins


[1:04] 
https://hee-tis.atlassian.net/wiki/spaces/TISDEV/pages/95748197/New+VNETs


graham [1:04 PM] 
hmm, heetis@HEE-TIS-VM-JENKINS:~$ ssh heetis@10.170.0.132
Permission denied (publickey).


fayaz [1:05 PM] 
looking at it


fayaz [1:14 PM] 
@here: please hold of your commits or merges to TIS-DEVOPS


panos
[1:22 PM] 
@fayaz Can I push to dev ?


fayaz [1:22 PM] 
yes @panos


panos
[1:22 PM] 
kk thanks!


fayaz [1:22 PM] 
@here: after 4pm today please don’t push anything to any environment


[1:22] 
I will be stopping the jenkins service at 4pm


[1:22] 
and prep for the migration


chrism [1:23 PM] 
It should be back on the old DNS now