2016-12-21 Intrepid ETL Failure
Date |
|
Authors | Grante Marshall (Unlicensed) Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | The Intrepid ETL process ran on production but failed due an error in the SQL query. The Elasticsearch snapshot restore also failed which left us unable to update the index for the day. We checked with Joanne Watson (Unlicensed) , the service manager, to see if it would impact the pilot but it wasn't being used so we didn't request access from Hicom to the DR to resolve the issue. |
Impact | the service wasn't usable for the day. |
Root Cause
A SQL query was referencing the test DR schema. Once we detected that the process had failed we checked the configuration of the Docker containers but quickly realised that a Docker image had updated on production so pre-production code was released.
Trigger
The nightly Intrepid ETL ran and failed.
Resolution
Fixed the versions of the containers in our configuration.
Detection
After the issues with Hicom's DR run on the team checked the service the following morning. Alex Dobre (Unlicensed) discovered the issue by looking at the container log files.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Create single config file for container versions | prevent | Graham O'Regan (Unlicensed) | |
Pin versions of containers in stage and prod | prevent | Graham O'Regan (Unlicensed) |
Timeline
- The etl-prod job ran at 3am
- The team checked the service at 9am
- The version of the container for the Intrepid ETL was pinned by Graham O'Regan (Unlicensed) at 9pm ahead of the next morning's run.
Supporting Information
Copying conversation between Grante and Naveen: grantemarshall [9:23 AM] can we carry out a restore based on the snapshot that is in blob storage? naveen [9:25 AM] Restore will happen only if the etl throws an error while saving or updating data. [9:25] In our case we had failed even to connect to INTREPID data [9:26] so restore won't be triggered if etl fails to connect to INTREPID schema grantemarshall [9:26 AM] So an edge case that's not been catered for yet. Can it be kicked off manually? naveen [9:27 AM] imagine this case: [9:27] we go for restore only if the elastic search index is deleted and latest data failed to update [9:28] that happens only in gmc-sync [9:28] currently if tis-new-core fails, it won't do the restore [9:28] when we merge both gmc-sync and core into one container in future, this edge case will go away [9:28] there is a backlog ticket to merge these 2 into one container [9:29] so for now, if we rerun the etl on LIVE, it should fetch grantemarshall [9:29 AM] so the etl fail will kick off the restore? do you want to bring this up in standup? naveen [9:29 AM] yes [9:30] For restoring url: POST /_snapshot/my_backup/snapshot_1/_restore [9:30] you can do that manually [9:31] my_backup is elasticsearch, snapshot_1 is snapshot name naveen [9:37 AM] to restore specific indices: [9:37] >>>POST /_snapshot/my_backup/snapshot_1/_restore { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": true, "rename_pattern": "index_(.+)", "rename_replacement": "restored_index_$1" } [9:37] its asynchronous process may take some time to restore grantemarshall [9:38 AM] so a curl command basically (or a URI task in ansible) naveen [9:38 AM] yes [9:39] programatically currently it happens only for gmc-sync but not for core intrepid etl, once they are merged it'll handle both cases [9:39] Refer Restore section in this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html grantemarshall [9:39 AM] can this be tested on dev or stage? naveen [9:40 AM] yes grantemarshall [9:58 AM] can you yell me how this was tested on dev? The restore procedure I mean naveen [10:01 AM] I changed the code and forced it to throw an exception when upating data into ES and it triggered snapshot restore programmatically. grantemarshall [10:27 AM] on dev I get the following message when trying to restore using the following curl command curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true,"rename_pattern":"index_(.+)", "rename_replacement":"restore_index_$1"}' {"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"},"status":500} naveen [10:28 AM] https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-open-close.html [10:29] For restore you can just specify index names: [10:29] >>>curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}' grantemarshall [10:34 AM] so this would be curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/index/_close' restore curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}' Then curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/index/_open' naveen [10:35 AM] yes grantemarshall [10:43 AM] It's not wanting to play curl -XPOST 'localhost:9200/elasticsearch-snapshots/_close?pretty' { "acknowledged" : true } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}' {"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"},"status":500} [10:44] the first curl command was actually : curl -XPOST 'localhost:9200/elasticsearch/_close?pretty' same result naveen [10:45 AM] dod u close all indices ? grantemarshall [10:49 AM] I've just tried to close a specific indices : curl -XPOST 'localhost:9200/elasch/arcps/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } [10:50] curl -XPOST 'localhost:9200/elasticsearch/arcps/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } naveen [10:52 AM] seems index is already closed from the error [10:52] can u wait 15 minutes and then try to do restore job ? [10:52] check if u get same 403 error on other indices also if u try to close them one by one ? grantemarshall [10:55 AM] yes I can wait [10:56] although if indices are all closed then the restore shouldn't fail due to indices being open naveen [10:58 AM] we've to be sure that all indices are closed [10:58] wait 10 mins and try to close each index one by one and you should get 403 on all of them grantemarshall [10:59 AM] I've been through every indices:- localhost:9200/elasticsearch/arcps/_close?pretty'{ "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/concerns/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/contact-details/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/placements/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/revalidations/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/self-declarations/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/trainee-cards/_close?pretty' { "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}' {"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"},"status":500} naveen [11:00 AM] seems self-declarations index has some problem [11:00] open self-declarations index and open it again [11:00] then close it and then try to restore each index only by one [11:01] instead of restoring all of them in one go [11:01] try to restore them one by one grantemarshall [11:02 AM] heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/arcps/_close?pretty'{ "error" : { "root_cause" : [ { "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" } ], "type" : "index_closed_exception", "reason" : "closed", "index" : "elasticsearch" }, "status" : 403 } heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps","ignore_unavailable":true, "include_global_state":true}' {"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"},"status":500} naveen [11:03 AM] Try this: [11:03] >>>curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps"}' grantemarshall [11:04 AM] heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps"}' {"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"},"status":500} [11:04] same result naveen [11:04 AM] seems on one of the shards the index is open [11:05] In our code we're doing this programmatically in a cluster mode which is why it was working [11:05] from command line when we try we are not running it as admin in a cluster grantemarshall [11:06 AM] understood. is there a way to force it via the cli or is there another requirements for a restore service to be created? naveen [11:07 AM] hmm not sure but if u check elastic search documentation it might help [11:09] are the indices in readonly mode in ES ? [11:09] https://github.com/elastic/elasticsearch/issues/3703 GitHub Elasticsearch head requests return 403 when made to readonly index. · Issue #3703 · elastic/elasticsearch · GitHub Elasticsearch head requests return 403 when made to readonly index. curl recreation of issue here. https://gist.github.com/paulrblakey/6580888 issue originally opened with ruflin/elastica [11:10] https://discuss.elastic.co/t/index-closed-exception/24343 weird same as ours [11:11] see this: [11:11] https://github.com/elastic/kibana/issues/7947
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213