Date	Feb 4, 2021
Authors	@Reuben Roberts @Joseph (Pepe) Kelly @Marcello Fabbri (Unlicensed)
Status	Resolved
Summary	`PersonElasticSearchSyncJob` ran simultaneously on both blue and green production servers (HEE-TIS-VM-PROD-APPS-BLUE and HEE-TIS-VM-PROD-APPS-GREEN). This caused duplicate Person entries in Elasticsearch. Manually rerunning the `PersonElasticSearchSyncJob` resolved the issue.
Impact	Users observed duplicate entries in the list of people.

Non-technical Description

An edge case sequence of events meant that the synchronisation of data for 'Person' ran on both load-balanced servers in the overnight jobs, creating duplicates. This is meant to be prevented by a locking mechanism to ensure the jobs only run on one of the servers. The team are investigating what contributed to the edge case scenario occurring, in order to mitigate against it reoccurring, and strengthening the logic governing the locking mechanism.

Trigger

Teams notification
Slack notification

Detection

A user reported in Teams Support Channel. The issue was also raised in the TIS tis-dev-team Slack channel.
The overlapping jobs could be viewed in the server logs

and also in the monitoring-prod Slack channel (started 1:29 AM and 1:33 AM):

Resolution

TIS Team manually re-ran the Person Sync job from the Sync administration panel: https://apps.tis.nhs.uk/sync/

Timeline

Feb 4, 2021 00:21 - Out of Memory Error on HEE-TIS-VM-PROD-APPS-BLUE
Feb 4, 2021 01:29 - PersonElasticSearchSyncJob : Sync [Person sync job] started on HEE-TIS-VM-PROD-APPS-GREEN
Feb 4, 2021 01:33 - PersonElasticSearchSyncJob : Sync [Person sync job] started on HEE-TIS-VM-PROD-APPS-BLUE
Feb 4, 2021 08:21 - Notification on Teams
Feb 4, 2021 09:25 - Job run again

Root Cause(s)

Job ran in parallel, one on each of the servers.
The ‘locking’ to prevent this only takes account of scheduled runs
Container restarted in the ~10 minute window where this problem would occur
This was triggered by an OutOfMemoryError:
2021-02-04 00:21:19.677 INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [49403] and lastEmployingBodyId: [287] 2021-02-04 00:21:27.760 INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate : Time taken to read chunk : [12.95 s] java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/apps/hprof/sync-2021-01-19-11:29:40.hprof ... Unable to create /var/log/apps/hprof/sync-2021-01-19-11:29:40.hprof: File exists Terminating due to java.lang.OutOfMemoryError: Java heap space

Action Items

Action Items	Owner

Action Items

Owner

Investigate and resolve Out of Memory errors

@Reuben Roberts

After a review of previous incidents?

Improve locking mechanism to make it more robust, i.e. locking that includes runs that aren’t part of the @Scheduled configuration

@Marcello Fabbri (Unlicensed)

Lessons Learned

The ‘locking’ to prevent the job running in parallel only takes account of scheduled runs. Any container restarts or manually running the job can cause duplication if it overlaps with the job running on the other server instance.
We need a more robust solution for preventing duplication of jobs running.

TIS21 Confluence Space

2021-02-04 Duplicate Trainees

Analytics