Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Detection

  • Slack monitoring alert.

  • Logging

Code Block
2022-02-06 15:55:53.067  INFO 1 --- [  AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade     : Processing [in/DE_EMD_RMC_20220206_00002940.DAT]
2022-02-06 15:55:53.067  INFO 1 --- [  AsyncThread-1] c.h.t.e.e.service.FileTransferService    : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod]
2022-02-06 15:55:53.185  INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable       : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod)

com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device
        at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314)
        at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282)

...

Resolution

  • Removed unneeded stack dump files from /var/log/apps

  • Identified files received but without confirmation of success and POSTed a request to the data reader, as a Lambda function would have done yesterday.

...

Timeline

  • 17:34 - Reference service failure alert on Slack

  • 18:34 - Reference service recovery alert on Slack

  • 18:44 - Reference service failure alert on Slack

  • These failures continue periodically until 07:34.

  • 09:10 - Reran GMC Sync - for connections.

  • 10:15-11:50 - Identified ESR files which may not have processed and re-ran import.

...

Root Cause(s)

  • Blue server ran out of disk space

  • We have no process to clean unneeded files (old logs, stack dumps, etc.)

  • We have inadequate monitoring on disk/storage usage

...