...
Detection
Slack monitoring alert.
Logging
Code Block |
---|
2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade : Processing [in/DE_EMD_RMC_20220206_00002940.DAT]
2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.service.FileTransferService : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod]
2022-02-06 15:55:53.185 INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod)
com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device
at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314)
at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282) |
...
Resolution
Removed unneeded stack dump files from /var/log/apps
Identified files received but without confirmation of success and
POST
ed a request to the data reader, as a Lambda function would have done yesterday.
...
Timeline
17:34 - Reference service failure alert on Slack
18:34 - Reference service recovery alert on Slack
18:44 - Reference service failure alert on Slack
These failures continue periodically until 07:34.
09:10 - Reran GMC Sync - for connections.
10:15-11:50 - Identified ESR files which may not have processed and re-ran import.
...
Root Cause(s)
Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage
...