Date |
|
Authors | |
Status | Documenting |
Summary | Person sync jobfailed job failed with exception [Unable to parse response body; nested exception is ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host |
Impact | No user(s) were impacted; the person search list was not complete between 2:29 & 7:36 am |
Table of Contents |
---|
Non-technical Description
Notification alert received from monitoring prod that Sync job has stalled at early morning of 14th of August causing timeout in bulk upload. The first TIS team member to see this logged on to TIS to re-run the job.
Trigger
...
Detection
Failed job status noticed in the #monitoring-prod slack channel
...
BST unless otherwise stated
2:29am Person elastic search sync failed
2:39am Sync [RevalCurrentPmSyncJob] started.
07:36am - Sync [RevalCurrentPmSyncJob] manually reran
07:36am - rerun of Sync [RevalCurrentPmSyncJob] Person elastic search sync completed.
Root Cause(s)
Sync failed: Why did it fail? Failed because of HTTP 429 error
Why did we get that error? The OpenSearch instance could not handle this request amongst the others it had received
Potential reasons:
Data node instance types and search or write limits
High values for instance metrics
Active and Queue threads
High CPU utilization and JVM memory pressure
...
Action Items
Action Items | Owner | |
---|---|---|
...
Lessons Learned
Learnt potential causes that could have caused Sync to fail?
Trigger the sync job again in time before our users are aware of the failure
Need to improve on more inclusive monitoring approach
To explore ways to increase our keywords lexicon monitoring for error coverage.