Date	21 Dec 2016
Authors	Grante Marshall (Unlicensed) Graham O'Regan (Unlicensed)
Status	Complete
Summary	The Intrepid ETL process ran on production but failed due an error in the SQL query. The Elasticsearch snapshot restore also failed which left us unable to update the index for the day. We checked with Joanne Watson (Unlicensed) , the service manager, to see if it would impact the pilot but it wasn't being used so we didn't request access from Hicom to the DR to resolve the issue.
Impact	the service wasn't usable for the day.

Root Cause

A SQL query was referencing the test DR schema. Once we detected that the process had failed we checked the configuration of the Docker containers but quickly realised that a Docker image had updated on production so pre-production code was released.

Trigger

The nightly Intrepid ETL ran and failed.

Resolution

Fixed the versions of the containers in our configuration.

Detection

After the issues with Hicom's DR run on 20 Dec 2016 the team checked the service the following morning. Alex Dobre (Unlicensed) discovered the issue by looking at the container log files.

Action Items

Action Item	Type	Owner	Issue
Create single config file for container versions	prevent	Graham O'Regan (Unlicensed)	TISDEV-1445 - Getting issue details... STATUS
Pin versions of containers in stage and prod	prevent	Graham O'Regan (Unlicensed)	TISDEV-1475 - Getting issue details... STATUS

Timeline

The etl-prod job ran at 3am
The team checked the service at 9am
The version of the container for the Intrepid ETL was pinned by Graham O'Regan (Unlicensed) at 9pm ahead of the next morning's run.

Supporting Information

etl-prod#54

Copying conversation between Grante and Naveen:

grantemarshall [9:23 AM] 
can we carry out a restore based on the snapshot that is in blob storage?
naveen [9:25 AM] 
Restore will happen only if the etl throws an error while saving or updating data.
[9:25] 
In our case we had failed even to connect to INTREPID data
[9:26] 
so restore won't be triggered if etl fails to connect to INTREPID schema
grantemarshall [9:26 AM] 
So an edge case that's not been catered for yet.
Can it be kicked off manually?
naveen [9:27 AM] 
imagine this case:
[9:27] 
we go for restore only if the elastic search index is deleted and latest data failed to update
[9:28] 
that happens only in gmc-sync
[9:28] 
currently if tis-new-core fails, it won't do the restore
[9:28] 
when we merge both gmc-sync and core into one container in future, this edge case will go away
[9:28] 
there is a backlog ticket to merge these 2 into one container
[9:29] 
so for now, if we rerun the etl on LIVE, it should fetch
grantemarshall [9:29 AM] 
so the etl fail will kick off the restore? do you want to bring this up in standup?
naveen [9:29 AM] 
yes
[9:30] 
For restoring url: POST /_snapshot/my_backup/snapshot_1/_restore
[9:30] 
you can do that manually
[9:31] 
my_backup is elasticsearch, snapshot_1 is snapshot name
naveen [9:37 AM] 
to restore specific indices:
[9:37] 
>>>POST /_snapshot/my_backup/snapshot_1/_restore
{
 "indices": "index_1,index_2",
 "ignore_unavailable": true,
 "include_global_state": true,
 "rename_pattern": "index_(.+)",
 "rename_replacement": "restored_index_$1"
}
[9:37] 
its asynchronous process may take some time to restore
grantemarshall [9:38 AM] 
so a curl command basically (or a URI task in ansible)
naveen [9:38 AM] 
yes
[9:39] 
programatically currently it happens only for gmc-sync but not for core intrepid etl, once they are merged it'll handle both cases
[9:39] 
Refer Restore section in this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
grantemarshall [9:39 AM] 
can this be tested on dev or stage?
naveen [9:40 AM] 
yes
grantemarshall [9:58 AM] 
can you yell me how this was tested on dev? The restore procedure I mean
naveen [10:01 AM] 
I changed the code and forced it to throw an exception when upating data into ES and it triggered snapshot restore programmatically.
grantemarshall [10:27 AM] 
on dev I get the following message when trying to restore using the following curl command
curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true,"rename_pattern":"index_(.+)", "rename_replacement":"restore_index_$1"}'
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"},"status":500}
naveen [10:28 AM] 
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-open-close.html
[10:29] 
For restore you can just specify index names:
[10:29] 
>>>curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}'
grantemarshall [10:34 AM] 
so this would be 
curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/index/_close'
restore 
curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}'
Then 
curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/index/_open'
naveen [10:35 AM] 
yes
grantemarshall [10:43 AM] 
It's not wanting to play
curl -XPOST 'localhost:9200/elasticsearch-snapshots/_close?pretty'
{
 "acknowledged" : true
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}'
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"},"status":500}
[10:44] 
the first curl command was actually :
curl -XPOST 'localhost:9200/elasticsearch/_close?pretty'
same result
naveen [10:45 AM] 
dod u close all indices ?
grantemarshall [10:49 AM] 
I've just tried to close a specific indices :
curl -XPOST 'localhost:9200/elasch/arcps/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
[10:50] 
curl -XPOST 'localhost:9200/elasticsearch/arcps/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
naveen [10:52 AM] 
seems index is already closed from the error
[10:52] 
can u wait 15 minutes and then try to do restore job ?
[10:52] 
check if u get same 403 error on other indices also if u try to close them one by one ?
grantemarshall [10:55 AM] 
yes I can wait
[10:56] 
although if indices are all closed then the restore shouldn't fail due to indices being open
naveen [10:58 AM] 
we've to be sure that all indices are closed
[10:58] 
wait 10 mins and try to close each index one by one and you should get 403 on all of them
grantemarshall [10:59 AM] 
I've been through every indices:-
localhost:9200/elasticsearch/arcps/_close?pretty'{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/concerns/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/contact-details/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/placements/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/revalidations/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/self-declarations/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/trainee-cards/_close?pretty'
{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps,concerns,contact-details,placements,revalidations,self-declarations,trainee-cards","ignore_unavailable":true, "include_global_state":true}'
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [self-declarations] because it's open"},"status":500}
naveen [11:00 AM] 
seems self-declarations index has some problem
[11:00] 
open self-declarations index and open it again
[11:00] 
then close it and then try to restore each index only by one
[11:01] 
instead of restoring all of them in one go
[11:01] 
try to restore them one by one
grantemarshall [11:02 AM] 
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'localhost:9200/elasticsearch/arcps/_close?pretty'{
 "error" : {
 "root_cause" : [ {
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 } ],
 "type" : "index_closed_exception",
 "reason" : "closed",
 "index" : "elasticsearch"
 },
 "status" : 403
}
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps","ignore_unavailable":true, "include_global_state":true}'
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"},"status":500}
naveen [11:03 AM] 
Try this:
[11:03] 
>>>curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps"}'
grantemarshall [11:04 AM] 
heetis@HEE-TIS-UBUNTU-API-GATEWAY-DEV:~$ curl -XPOST 'http://localhost:9200/_snapshot/elasticsearch/2016.12.21/_restore' -d '{"indices": "arcps"}'
{"error":{"root_cause":[{"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"}],"type":"snapshot_restore_exception","reason":"[elasticsearch:2016.12.21] cannot restore index [arcps] because it's open"},"status":500}
[11:04] 
same result
naveen [11:04 AM] 
seems on one of the shards the index is open
[11:05] 
In our code we're doing this programmatically in a cluster mode which is why it was working
[11:05] 
from command line when we try we are not running it as admin in a cluster
grantemarshall [11:06 AM] 
understood. is there a way to force it via the cli or is there another requirements for a restore service to be created?
naveen [11:07 AM] 
hmm not sure but if u check elastic search documentation it might help
[11:09] 
are the indices in readonly mode in ES ?
[11:09] 
https://github.com/elastic/elasticsearch/issues/3703
 GitHub
Elasticsearch head requests return 403 when made to readonly index. · Issue #3703 · elastic/elasticsearch · GitHub
Elasticsearch head requests return 403 when made to readonly index. curl recreation of issue here. https://gist.github.com/paulrblakey/6580888 issue originally opened with ruflin/elastica
[11:10] 
https://discuss.elastic.co/t/index-closed-exception/24343 weird same as ours
[11:11] 
see this:
[11:11] 
https://github.com/elastic/kibana/issues/7947

TIS21 Confluence Space

2016-12-21 Intrepid ETL Failure