2022-02-04 Reference service unstable on one production server

Date

Feb 4, 2022

Authors

@Reuben Roberts

Status

Done

Summary

The reference service became unstable on one of the production servers https://hee-tis.atlassian.net/browse/TIS21-2644

Impact

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resilience reasons.

The “blue” server ran out of disk space, causing the reference service to stop functioning.

 

 


Trigger

  • TIS blue server ran out of disk space.

Detection

  • Slack monitoring alert.

  • Logging

2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade : Processing [in/DE_EMD_RMC_20220206_00002940.DAT] 2022-02-06 15:55:53.067 INFO 1 --- [ AsyncThread-1] c.h.t.e.e.service.FileTransferService : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod] 2022-02-06 15:55:53.185 INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod) com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314) at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282)

Resolution

  • Removed unneeded stack dump files from /var/log/apps

 

  • Identified files received but without confirmation of success and POSTed a request to the data reader, as a Lambda function would have done yesterday.


Timeline

  • Feb 4, 2022 17:34 - Reference service failure alert on Slack

  • Feb 4, 2022 18:34 - Reference service recovery alert on Slack

  • Feb 4, 2022 18:44 - Reference service failure alert on Slack

  • These failures continue periodically until Feb 7, 2022 07:34.

  • Feb 7, 2022 09:10 - Reran GMC Sync - for connections.

  • Feb 7, 2022 10:15-11:50 - Identified ESR files which may not have processed and re-ran import.


Root Cause(s)

  • Blue server ran out of disk space

  • We have no process to clean unneeded files (old logs, stack dumps, etc.)

  • We have inadequate monitoring on disk/storage usage


Action Items

Action Items

Owner

Action Items

Owner

Consider prioritising https://hee-tis.atlassian.net/browse/TIS21-1383

 

 

 


Lessons Learned

  •  We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.