Date | |||||||||
Authors | |||||||||
Status | DocumentingDone | ||||||||
Summary | The reference service became unstable on one of the production servers
| ||||||||
Impact | TIS running at reduced capacity |
Non-technical Description
TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resilience reasons.
The “blue” server ran out of disk space, causing the reference service to stop functioning.
...
Trigger
TIS blue server ran out of disk space.
Detection
Slack monitoring alert.
Logging
Code Block |
---|
2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade : Processing [in/DE_EMD_RMC_20220206_00002940.DAT] 2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.service.FileTransferService : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod] 2022-02-06 15:55:53.185 INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod) com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314) at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282) |
...
Resolution
Removed unneeded stack dump files from /var/log/apps
Identified files received but without confirmation of success and
POST
ed a request to the data reader, as a Lambda function would have done yesterday.
...
Timeline
17:34 - Reference service failure alert on Slack
18:34 - Reference service recovery alert on Slack
18:44 - Reference service failure alert on Slack
These failures continue periodically until 07:34.
09:10 - Reran GMC Sync - for connections.
10:15-11:50 - Identified ESR files which may not have processed and re-ran import.
...
Root Cause(s)
Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage
...
Action Items
Action Items | Owner | ||||||||
---|---|---|---|---|---|---|---|---|---|
Consider prioritising
| |||||||||
...
Lessons Learned
We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.