Date	2017-09-13
Authors	Graham O'Regan (Unlicensed)
Status	Complete
Summary	The sql sync for concerns and revalidations did not finish in live
Impact	The users would not have seen their concerns and revalidations in the UI

Root Cause

It appears that the problem was with the Profile service, restarting it seemed to fix the problem. We came across some red herrings when trying to debug the problem but the profile service seemed to be the main issue.

Trigger

The nightly run of the nightly GMC Sync ETL

Resolution

We tried rerunning the ETL to see it is was a transient issue but eventually restarted the Profile service which seemed to fix the problem.

Detection

The GMC Sync job sends an alert to Slack if it fails.

Action Items

Action Item	Type	Owner	Issue
Process data syncs in the background	mitigate		TISDEV-2586 - Getting issue details... STATUS
Move GMC Sync failures from #dev to #monitoring in Slack	mitigate	Graham O'Regan (Unlicensed)

Timeline

3:35 The https://build.lin.nhs.uk/jenkins/job/gmc-sync-prod/40/ job failed and alerted into #dev on Slack

8:24 Graham O'Regan (Unlicensed) started looking into the issue

10:12 Graham O'Regan (Unlicensed) reran the intrepid-reval-etl to update the data https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/324/

Supporting Information

Slack transcript

jenkins APP [3:35 AM]
----------------
gmc-sync-prod - #40 Failure after 4 min 43 sec (Open)

1 reply Today at 8:24 AM View thread

graham [9:55 AM]
@channel still getting timeouts on the http://concerns:8084/api/sync-data call which is causing issues on live

[9:56]
anyone recognise the issue? (edited)

srochani [9:57 AM]
@graham is it because of keycloak upgrade?

graham [9:57 AM]
gmc-sync is calling concerns to sync the data which, in turn, makes 3 calls, audit start, sync() then audit finish. i can see the last audit update

[9:57]
@srochani no, prod hasn’t been updated

[9:57]
and the services aren’t using it internally yet

srochani [9:58 AM]
ok will investigate

graham [9:58 AM]
@fayaz anything obvious on the system side?

fayaz [9:59 AM]
nope, logging on prod,

graham [9:59 AM]
i’ve restarted the concerns and profile services, i *think* this might be coming from elasticsearch

fayaz [10:00 AM]
when did you restarted, 45mins back?

graham [10:03 AM]
yup

fayaz [10:04 AM]
added and commented on this Plain Text snippet
heetis@HEE-TIS-UBUNTU-API-GATEWAY-PROD:~$ curl -XGET 'localhost:9200/concerns?v&pretty'
{
"concerns" : {
"aliases" : { },
"mappings" : {
"concern" : {
"properties" : {
"closedBy" : {
"type" : "string"
},
"concernDetails" : {
"type" : "string"
},
"contactPerson" : {
"type" : "string"
},
"createdDateTime" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"dateClosed" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"dateReportedToHee" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"designatedBodyCode" : {
"type" : "string",
"index" : "not_analyzed"
},
"employer" : {
"type" : "string"
},
"firstName" : {
"type" : "string"
},
"gmcNumber" : {
"type" : "string"
},
"gradeAtTimeOfIncident" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"incidentDate" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"incidentTypeCode" : {
"type" : "string"
},
"lastName" : {
"type" : "string"
},
"locationCode" : {
"type" : "string"
},
"sourceCode" : {
"type" : "string"
},
"statusCode" : {
"type" : "string"
},
"tisId" : {
"type" : "string"
},
"userId" : {
"type" : "string"
}
}
}
},
"settings" : {
"index" : {
"max_result_window" : "1000000",
"creation_date" : "1494948900594",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "6v1o3gehTPCvWvWQCSV_mw",
"version" : {
"created" : "2040299"
}
}
},
"warmers" : { }
}
}
1 Comment Collapse
do you see anything out of norm

graham [10:05 AM]
that is just the mapping, browsing the indexes i can see that there is data there but it looks sparsely populated

[10:06]
@apringle can you login to live and take a look at the data?

[10:06]
it looks like the programme info is empty

fayaz [10:06 AM]
uploaded this image: Pasted image at 2017-09-13, 10:06 AM
Add Comment

graham [10:08 AM]
i think the problem might be that the client in gmc-sync is call concerns, concerns is completing but takes too long so the client stops and throws an exception even tho the original request is still running

[10:09]
so concerns might be a distraction, the data on the under notice tab seems to be missing info for several columns tho

fayaz [10:10 AM]
are you seeing errors in the log

graham [10:13 AM]
i’m going to rerun the reval-etl to see if it repopulates the index correctly https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/324/

graham [10:50 AM]
@here that is repopulating the indexes correctly, still a few mins left to go on it