Date | |
Authors | |
Status | Documenting |
Summary | The reference service became unstable on one of the production servers - TIS21-2644Getting issue details... STATUS |
Impact | TIS running at reduced capacity |
Non-technical Description
TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.
The “blue” server ran out of disk space, causing the reference service to stop functioning.
Trigger
TIS blue server ran out of disk space.
Detection
Slack monitoring alert.
Logging
2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade : Processing [in/DE_EMD_RMC_20220206_00002940.DAT] 2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.service.FileTransferService : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod] 2022-02-06 15:55:53.185 INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod) com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314) at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282)
Resolution
Removed unneeded stack dump files from /var/log/apps
Identified files received but without confirmation of success and
POST
ed a request to the data reader, as a Lambda function would have done yesterday.
Timeline
17:34 - Reference service failure alert on Slack
18:34 - Reference service recovery alert on Slack
18:44 - Reference service failure alert on Slack
These failures continue periodically until 07:34.
09:10 - Reran GMC Sync - for connections.
10:15-11:50 - Identified ESR files which may not have processed and re-ran import.
Root Cause(s)
Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage
Action Items
Lessons Learned
We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.
0 Comments