2022-02-04 Reference service unstable on one production server
Date | Feb 4, 2022 |
Authors | @Reuben Roberts |
Status | Done |
Summary | The reference service became unstable on one of the production servers https://hee-tis.atlassian.net/browse/TIS21-2644 |
Impact | TIS running at reduced capacity |
Non-technical Description
TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resilience reasons.
The “blue” server ran out of disk space, causing the reference service to stop functioning.
Trigger
TIS blue server ran out of disk space.
Detection
Slack monitoring alert.
Logging
2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade : Processing [in/DE_EMD_RMC_20220206_00002940.DAT]
2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.service.FileTransferService : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod]
2022-02-06 15:55:53.185 INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod)
com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device
at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314)
at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282)
Resolution
Removed unneeded stack dump files from /var/log/apps
Identified files received but without confirmation of success and
POST
ed a request to the data reader, as a Lambda function would have done yesterday.
Timeline
Feb 4, 2022 17:34 - Reference service failure alert on Slack
Feb 4, 2022 18:34 - Reference service recovery alert on Slack
Feb 4, 2022 18:44 - Reference service failure alert on Slack
These failures continue periodically until Feb 7, 2022 07:34.
Feb 7, 2022 09:10 - Reran GMC Sync - for connections.
Feb 7, 2022 10:15-11:50 - Identified ESR files which may not have processed and re-ran import.
Root Cause(s)
Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage
Action Items
Action Items | Owner |
---|---|
Consider prioritising https://hee-tis.atlassian.net/browse/TIS21-1383 |
|
|
|
Lessons Learned
We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213