Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Non-technical Description

Prod blue ran out of storage space during the GMC ETL, then fell over when space was freed up.

...

Trigger

  • Prod Blue ran out of storage space and therefore couldn't perform any ETL’s as there was no data space available to store anything locally.

...

Detection

  • Adewale Adekoya mentioned that some of the Reval users had noticed there was not any data in the Reval part of TIS in the Teams Channel.

...

Resolution

  • Clean out some of the “Large” logs on the server, then reboot.

...

Timeline

  • - 10.17 - Server Logs trimmed and instance Rebooted (server failed to restart correctly)

  • - 10:33 - Forced restart on instance through AWS console

  • - 10:34 - First reported in Teams

  • - 10:38 - Server became responsive again, TIS started working again but Reval overnight workflow had to be rerun

  • - 10:40 - Restarted GMC-Sync-Prod

  • - 10:47 - Restarted Intrepid-reval-etl-all-prod

  • - 11:01 - All ETL services completed and confirmed that data was available.

...

Root Cause(s)

  • .Storage space was consumed by a huge apache modsecurity log

...

Action Items

Action Items

Owner

...

Lessons Learned

  • Add more monitoring to instance storage.