Date	04 Feb 2022
Authors	Reuben Roberts
Status	Documenting
Summary	The reference service became unstable on one of the production servers TIS21-2644 - Getting issue details... STATUS
Impact	TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resiliance reasons.

The “blue” server ran out of disk space, causing the reference service to stop functioning.

Trigger

TIS blue server ran out of disk space.

Detection

Slack monitoring alert.
Logging

2022-02-06 15:55:53.067  INFO 1 --- [  AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade     : Processing [in/DE_EMD_RMC_20220206_00002940.DAT]
2022-02-06 15:55:53.067  INFO 1 --- [  AsyncThread-1] c.h.t.e.e.service.FileTransferService    : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod]
2022-02-06 15:55:53.185  INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable       : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod)

com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device
        at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314)
        at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282)

Resolution

Removed unneeded stack dump files from /var/log/apps

Identified files received but without confirmation of success and POSTed a request to the data reader, as a Lambda function would have done yesterday.

Timeline

04 Feb 2022 17:34 - Reference service failure alert on Slack
04 Feb 2022 18:34 - Reference service recovery alert on Slack
04 Feb 2022 18:44 - Reference service failure alert on Slack
These failures continue periodically until 07 Feb 2022 07:34.
07 Feb 2022 09:10 - Reran GMC Sync - for connections.
07 Feb 2022 10:15-11:50 - Identified ESR files which may not have processed and re-ran import.

Root Cause(s)

Blue server ran out of disk space
We have no process to clean unneeded files (old logs, stack dumps, etc.)
We have inadequate monitoring on disk/storage usage

Action Items

Action Items	Owner
Consider prioritising TIS21-1383 - Getting issue details... STATUS

Lessons Learned

We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.

2022-02-04 Reference service unstable on one production server