2021-08-26 TIS down
Date | Aug 26, 2021 |
Authors | @John Simmons (Deactivated) |
Status | Documenting |
Summary | Prod blue fell over (Prod green was up) for 12 mins |
Impact | Users could not access TIS for a short period of time |
Non-technical Description
Prod blue ran out of storage space during the GMC ETL, then fell over when space was freed up.
Trigger
Prod Blue ran out of storage space and therefore couldn't perform any ETL’s as there was no data space available to store anything locally.
Detection
@Adewale Adekoya mentioned that some of the Reval users had noticed there was not any data in the Reval part of TIS in the Teams Channel.
Resolution
Clean out some of the “Large” logs on the server, then reboot.
Timeline
Aug 26, 2021 - 10.17 - Server Logs trimmed and instance Rebooted (server failed to restart correctly)
Aug 26, 2021 - 10:33 - Forced restart on instance through AWS console
Aug 26, 2021 - 10:34 - First reported in Teams
Aug 26, 2021 - 10:38 - Server became responsive again, TIS started working again but Reval overnight workflow had to be rerun
Aug 26, 2021 - 10:40 - Restarted GMC-Sync-Prod
Aug 26, 2021 - 10:47 - Restarted Intrepid-reval-etl-all-prod
Aug 26, 2021 - 11:01 - All ETL services completed and confirmed that data was available.
Root Cause(s)
Storage space was consumed by a huge apache modsecurity log
Action Items
Action Items | Owner |
---|---|
|
|
|
|
|
|
|
|
Lessons Learned
Add more monitoring to instance storage.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213