2016-12-19 Elasticsearch snapshots deleted on production
Date |
|
Authors | Grante Marshall (Unlicensed) Graham O'Regan (Unlicensed) |
Status | Complete |
Summary | Elasticsearch snapshots being deleted in Production |
Impact | didn't affect service |
Root Cause
The configuration settings for Curator on production were set to only retain a single snapshot. When the snapshot process failed Curator removed the only remaining snapshot in Azure Blobstorage.
Trigger
The nightly snapshotting jobs from Jenkins failed because existing snapshots already existed. Curator then ran and deleted the only remaining snapshot.
Resolution
The number of snapshots was increased to 5, one per day.
Detection
There are Jenkins jobs that run the Elasticsearch snapshots which failed and the failures were reported to the #dev channel in Slack.
Action Items
Action Item | Type | Owner | Issue |
---|---|---|---|
Increase the number of snapshots retained | prevent | Grante Marshall (Unlicensed) |
Timeline
- 11pm Jenkins job ran
- Jenkins sent notification to Slack
- N (Unlicensed) changed the Curator settings to retain 5 days of snapshots.
Supporting Information
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213