...
Non-technical Description
Prod blue ran out of storage space during the GMC ETL, then fell over when space was freed up.
...
Trigger
Prod Blue ran out of storage space and therefore couldn't perform any ETL’s as there was no data space available to store anything locally.
...
Detection
Adewale Adekoya mentioned that some of the Reval users had noticed there was not any data in the Reval part of TIS in the Teams Channel.
...
Resolution
Clean out some of the “Large” logs on the server, then reboot.
...
Timeline
- 10.17 - Server Logs trimmed and instance Rebooted (server failed to restart correctly)
- 10:33 - Forced restart on instance through AWS console
- 10:34 - First reported in Teams
- 10:38 - Server became responsive again, TIS started working again but Reval overnight workflow had to be rerun
- 10:40 - Restarted GMC-Sync-Prod
- 10:47 - Restarted Intrepid-reval-etl-all-prod
- 11:01 - All ETL services completed and confirmed that data was available.
...
Root Cause(s)
.Storage space was consumed by a huge apache modsecurity log
...
Action Items
Action Items | Owner |
---|---|
...
Lessons Learned
Add more monitoring to instance storage.