2017-09-13 Sync for Concerns and Revalidations errors

Date2017-09-13
AuthorsGraham O'Regan (Unlicensed)
StatusComplete
SummaryThe sql sync for concerns and revalidations did not finish in live
ImpactThe users would not have seen their concerns and revalidations in the UI

Root Cause

It appears that the problem was with the Profile service, restarting it seemed to fix the problem. We came across some red herrings when trying to debug the problem but the profile service seemed to be the main issue.

Trigger

The nightly run of the nightly GMC Sync ETL

Resolution

We tried rerunning the ETL to see it is was a transient issue but eventually restarted the Profile service which seemed to fix the problem.

Detection

The GMC Sync job sends an alert to Slack if it fails.

Action Items

Action ItemTypeOwnerIssue
Process data syncs in the backgroundmitigate

TISDEV-2586 - Getting issue details... STATUS

Move GMC Sync failures from #dev to #monitoring in SlackmitigateGraham O'Regan (Unlicensed)

Timeline

3:35 The https://build.lin.nhs.uk/jenkins/job/gmc-sync-prod/40/ job failed and alerted into #dev on Slack

8:24 Graham O'Regan (Unlicensed) started looking into the issue

10:12 Graham O'Regan (Unlicensed) reran the intrepid-reval-etl to update the data https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/324/

Supporting Information

Slack transcript


jenkins APP [3:35 AM]
----------------
gmc-sync-prod - #40 Failure after 4 min 43 sec (Open)

1 reply Today at 8:24 AM View thread


graham [9:55 AM]
@channel still getting timeouts on the http://concerns:8084/api/sync-data call which is causing issues on live


[9:56]
anyone recognise the issue? (edited)


srochani [9:57 AM]
@graham is it because of keycloak upgrade?


graham [9:57 AM]
gmc-sync is calling concerns to sync the data which, in turn, makes 3 calls, audit start, sync() then audit finish. i can see the last audit update


[9:57]
@srochani no, prod hasn’t been updated


[9:57]
and the services aren’t using it internally yet


srochani [9:58 AM]
ok will investigate


graham [9:58 AM]
@fayaz anything obvious on the system side?


fayaz [9:59 AM]
nope, logging on prod,


graham [9:59 AM]
i’ve restarted the concerns and profile services, i *think* this might be coming from elasticsearch


fayaz [10:00 AM]
when did you restarted, 45mins back?


graham [10:03 AM]
yup


fayaz [10:04 AM]
added and commented on this Plain Text snippet
heetis@HEE-TIS-UBUNTU-API-GATEWAY-PROD:~$ curl -XGET 'localhost:9200/concerns?v&pretty'
{
"concerns" : {
"aliases" : { },
"mappings" : {
"concern" : {
"properties" : {
"closedBy" : {
"type" : "string"
},
"concernDetails" : {
"type" : "string"
},
"contactPerson" : {
"type" : "string"
},
"createdDateTime" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"dateClosed" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"dateReportedToHee" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"designatedBodyCode" : {
"type" : "string",
"index" : "not_analyzed"
},
"employer" : {
"type" : "string"
},
"firstName" : {
"type" : "string"
},
"gmcNumber" : {
"type" : "string"
},
"gradeAtTimeOfIncident" : {
"type" : "string"
},
"id" : {
"type" : "long"
},
"incidentDate" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"incidentTypeCode" : {
"type" : "string"
},
"lastName" : {
"type" : "string"
},
"locationCode" : {
"type" : "string"
},
"sourceCode" : {
"type" : "string"
},
"statusCode" : {
"type" : "string"
},
"tisId" : {
"type" : "string"
},
"userId" : {
"type" : "string"
}
}
}
},
"settings" : {
"index" : {
"max_result_window" : "1000000",
"creation_date" : "1494948900594",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "6v1o3gehTPCvWvWQCSV_mw",
"version" : {
"created" : "2040299"
}
}
},
"warmers" : { }
}
}
1 Comment Collapse
do you see anything out of norm


graham [10:05 AM]
that is just the mapping, browsing the indexes i can see that there is data there but it looks sparsely populated


[10:06]
@apringle can you login to live and take a look at the data?


[10:06]
it looks like the programme info is empty


fayaz [10:06 AM]
uploaded this image: Pasted image at 2017-09-13, 10:06 AM
Add Comment


graham [10:08 AM]
i think the problem might be that the client in gmc-sync is call concerns, concerns is completing but takes too long so the client stops and throws an exception even tho the original request is still running


[10:09]
so concerns might be a distraction, the data on the under notice tab seems to be missing info for several columns tho


fayaz [10:10 AM]
are you seeing errors in the log


graham [10:13 AM]
i’m going to rerun the reval-etl to see if it repopulates the index correctly https://build.tis.nhs.uk/jenkins/view/Intrepid/job/intrepid-reval-etl-all-prod/324/


graham [10:50 AM]
@here that is repopulating the indexes correctly, still a few mins left to go on it