Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Date

Authors

Reuben Roberts

Status

Resolved

Summary

PersonElasticSearchSyncJob ran simultaneously on both blue and green production servers (HEE-TIS-VM-PROD-APPS-BLUE and HEE-TIS-VM-PROD-APPS-GREEN). This caused duplicate Person entries in Elasticsearch.

Manually rerunning the PersonElasticSearchSyncJob resolved the issue.

Impact

 Users observed duplicate entries in the list of people.

Non-technical Description


Trigger

  • Teams notification

  • Slack notification


Detection

  • A user reported in Teams Support Channel. The issue was also raised in the TIS tis-dev-team Slack channel.

  • The overlapping jobs could be viewed in the server logs


    and also in the monitoring-prod Slack channel (started 1:29 AM and 1:33 AM):


Resolution


Timeline

  • 00:21 - Out of Memory Error on HEE-TIS-VM-PROD-APPS-BLUE

  • 01:29 - PersonElasticSearchSyncJob : Sync [Person sync job] started on HEE-TIS-VM-PROD-APPS-GREEN

  • 01:33 - PersonElasticSearchSyncJob : Sync [Person sync job] started on HEE-TIS-VM-PROD-APPS-BLUE

  • 08:21 - Notification on Teams

  • 09:25 - Job run again


Root Cause(s)

  • Job ran in parallel, one on each of the servers.

  • The ‘locking’ to prevent this only takes account of scheduled runs

  • Container restarted in the ~10 minute window where this problem would occur

  • This was triggered by an OutOfMemoryError:

    2021-02-04 00:21:19.677  INFO 1 --- [onPool-worker-2] s.j.PersonPlacementEmployingBodyTrustJob : Querying with lastPersonId: [49403] and lastEmployingBodyId: [287]
    2021-02-04 00:21:27.760  INFO 1 --- [onPool-worker-2] u.n.t.s.job.TrustAdminSyncJobTemplate    : Time taken to read chunk : [12.95 s]
    java.lang.OutOfMemoryError: Java heap space
    Dumping heap to /var/log/apps/hprof/sync-2021-01-19-11:29:40.hprof ...
    Unable to create /var/log/apps/hprof/sync-2021-01-19-11:29:40.hprof: File exists
    Terminating due to java.lang.OutOfMemoryError: Java heap space

Action Items

Action Items

Owner

  • Investigate and resolve Out of Memory errors

  • Improve locking mechanism to make it more robust

Reuben Roberts

Marcello Fabbri (Unlicensed)


Lessons Learned

  • The ‘locking’ to prevent the job running in parallel only takes account of scheduled runs. Any container restarts or manually running the job can cause duplication if it overlaps with the job running on the other server instance.

  • No labels

0 Comments

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.