2021-08-26 TIS down

Date

Aug 26, 2021

Authors

@John Simmons (Deactivated)

Status

Documenting

Summary

Prod blue fell over (Prod green was up) for 12 mins

Impact

Users could not access TIS for a short period of time

Non-technical Description

Prod blue ran out of storage space during the GMC ETL, then fell over when space was freed up.


Trigger

  • Prod Blue ran out of storage space and therefore couldn't perform any ETL’s as there was no data space available to store anything locally.


Detection

  • @Adewale Adekoya mentioned that some of the Reval users had noticed there was not any data in the Reval part of TIS in the Teams Channel.


Resolution

  • Clean out some of the “Large” logs on the server, then reboot.


Timeline

  • Aug 26, 2021 - 10.17 - Server Logs trimmed and instance Rebooted (server failed to restart correctly)

  • Aug 26, 2021 - 10:33 - Forced restart on instance through AWS console

  • Aug 26, 2021 - 10:34 - First reported in Teams

  • Aug 26, 2021 - 10:38 - Server became responsive again, TIS started working again but Reval overnight workflow had to be rerun

  • Aug 26, 2021 - 10:40 - Restarted GMC-Sync-Prod

  • Aug 26, 2021 - 10:47 - Restarted Intrepid-reval-etl-all-prod

  • Aug 26, 2021 - 11:01 - All ETL services completed and confirmed that data was available.

 


Root Cause(s)

  • Storage space was consumed by a huge apache modsecurity log


Action Items

Action Items

Owner

Action Items

Owner

 

 

 

 

 

 

 

 


Lessons Learned

  • Add more monitoring to instance storage.