Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In Progress

Date

Authors

Reuben Roberts Joseph (Pepe) Kelly Marcello Fabbri (Unlicensed)

Status

Resolved

Summary

Duplicate Trainees

Impact

 

  • Root Cause(s)

  • Trigger

  • Resolution

  • Detection

  • Action Items

  • Timeline

Root Cause(s)

  • Still unknown for the time being

Trigger

  • Teams Notifcation

  • Image Removed

Resolution

  • T

Detection

  • De

Timeline

PersonElasticSearchSyncJob ran simultaneously on both blue and green production servers (HEE-TIS-VM-PROD-APPS-BLUE and HEE-TIS-VM-PROD-APPS-GREEN). This caused duplicate Person entries in Elasticsearch.

Manually rerunning the PersonElasticSearchSyncJob resolved the issue.

Impact

 Users observed duplicate entries in the list of people.

Table of Contents

Non-technical Description

An edge case sequence of events meant that the synchronisation of data for 'Person' ran on both load-balanced servers in the overnight jobs, creating duplicates. This is meant to be prevented by a locking mechanism to ensure the jobs only run on one of the servers. The team are investigating what contributed to the edge case scenario occurring, in order to mitigate against it reoccurring, and strengthening the logic governing the locking mechanism.

...

Trigger

  • Teams notification

  • Image Added

    Slack notification

    Image Added

...

Detection

  • A user reported in Teams Support Channel. The issue was also raised in the TIS tis-dev-team Slack channel.

  • The overlapping jobs could be viewed in the server logs

    Image Added


    and also in the monitoring-prod Slack channel (started 1:29 AM and 1:33 AM):

    Image Added

...

Resolution

...

Timeline

  • 00:21 - Out of Memory Error on HEE-TIS-VM-PROD-APPS-BLUE

  • 01:29 - PersonElasticSearchSyncJob : Sync [Person sync job] started on HEE-TIS-VM-PROD-APPS-GREEN

  • 01:33 - PersonElasticSearchSyncJob : Sync [Person sync job] started on HEE-TIS-VM-PROD-APPS-BLUE

  • 08:21 - Notification on Teams

  • 09:25 - Job run again

...

Root Cause(s)

  • Job ran in parallel, one on each of the servers.

  • The ‘locking’ to prevent this only takes account of scheduled runs

  • Container restarted in the ~10 minute window where this problem would occur

  • This was triggered by an OutOfMemoryError:

    Code Block
    2021-02-04 00:21:19.677  INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [49403] and lastEmployingBodyId: [287]
    2021-02-04 00:21:27.760  INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate    : Time taken to read chunk : [12.95 s]
    java.lang.OutOfMemoryError: Java heap space
    Dumping heap to /var/log/apps/hprof/sync-2021-01-19-11:29:40.hprof ...
    Unable to create /var/log/apps/hprof/sync-2021-01-19-11:29:40.hprof: File exists
    Terminating due to java.lang.OutOfMemoryError: Java heap space

...

Action Items

Action Items

Owner

  • Investigate and resolve Out of Memory errors

Reuben Roberts

After a review of previous incidents?

  • Improve locking mechanism to make it more robust, i.e. locking that includes runs that aren’t part of the @Scheduled configuration

Marcello Fabbri (Unlicensed)

...

Lessons Learned

  • The ‘locking’ to prevent the job running in parallel only takes account of scheduled runs. Any container restarts or manually running the job can cause duplication if it overlaps with the job running on the other server instance.

  • We need a more robust solution for preventing duplication of jobs running.