2023-08-14 Sync Service NI

Date

Aug 14, 2023

Authors

@Joseph (Pepe) Kelly @Yafang Deng @Jayanta Saha

Status

Documenting

Summary

Person sync job failed with exception [Unable to parse response body; nested exception is ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host

Impact

No user(s) were impacted; the person search list was not complete between 2:29 & 7:36 am

Non-technical Description

Notification alert received from monitoring prod that Sync job has stalled at early morning of 14th of August causing timeout in bulk upload. The first TIS team member to see this logged on to TIS to re-run the job.

 

Trigger


Detection

  • Failed job status noticed in the #monitoring-prod slack channel


Resolution

  • The Person Sync job was manually restarted and completed successfully.


Timeline

BST unless otherwise stated

  • Aug 14, 2023 2:29am Person elastic search sync failed

  • Aug 14, 2023 07:36am - rerun of Person elastic search sync completed.

Root Cause(s)

Sync failed: Why did it fail? Failed because of HTTP 429 error

Why did we get that error? The OpenSearch instance could not handle this request amongst the others it had received

Potential reasons:

  • Data node instance types and search or write limits

  • High values for instance metrics

  • Active and Queue threads

  • High CPU utilization and JVM memory pressure

 

 

 


Action Items

Action Items

Owner

 

Action Items

Owner

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Lessons Learned

  • Learnt potential causes that could have caused Sync to fail?

  • Trigger the sync job again in time before our users are aware of the failure

  • Need to improve on more inclusive monitoring approach

  • To explore ways to increase our keywords lexicon monitoring for error coverage.