Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Documenting

Date

Authors

Reuben Roberts

Status

Done

Summary

The reference service became unstable on one of the production servers

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-2644

Impact

Users should generally have been unaffected since the reference service continued to run on the other production server.

Non-technical Description

Trigger

Detection

  • .

Resolution

Timeline

Root Cause(s)

TIS running at reduced capacity

Non-technical Description

TIS is split across two different servers, blue and green, requests are balanced across these two servers for performance and resilience reasons.

The “blue” server ran out of disk space, causing the reference service to stop functioning.

...

Trigger

  • TIS blue server ran out of disk space.

Detection

  • Slack monitoring alert.

  • Logging

Code Block
2022-02-06 15:55:53.067  INFO 1 --- [  AsyncThread-1] c.h.t.e.e.facade.FileProcessorFacade     : Processing [in/DE_EMD_RMC_20220206_00002940.DAT]
2022-02-06 15:55:53.067  INFO 1 --- [  AsyncThread-1] c.h.t.e.e.service.FileTransferService    : Downloading [in/DE_EMD_RMC_20220206_00002940.DAT] from S3 bucket [esr-sftp-prod]
2022-02-06 15:55:53.185  INFO 1 --- [anager-worker-5] c.a.s.s3.transfer.DownloadCallable       : Retry the download of object in/DE_EMD_RMC_20220206_00002940.DAT (bucket esr-sftp-prod)

com.amazonaws.SdkClientException: Unable to store object contents to disk: No space left on device
        at com.amazonaws.services.s3.internal.ServiceUtils.downloadToFile(ServiceUtils.java:314)
        at com.amazonaws.services.s3.transfer.DownloadCallable.retryableDownloadS3ObjectToFile(DownloadCallable.java:282)

...

Resolution

  • Removed unneeded stack dump files from /var/log/apps

  • Identified files received but without confirmation of success and POSTed a request to the data reader, as a Lambda function would have done yesterday.

...

Timeline

  • 17:34 - Reference service failure alert on Slack

  • 18:34 - Reference service recovery alert on Slack

  • 18:44 - Reference service failure alert on Slack

  • These failures continue periodically until 07:34.

  • 09:10 - Reran GMC Sync - for connections.

  • 10:15-11:50 - Identified ESR files which may not have processed and re-ran import.

...

Root Cause(s)

  • Blue server ran out of disk space

  • We have no process to clean unneeded files (old logs, stack dumps, etc.)

  • We have inadequate monitoring on disk/storage usage

...

Action Items

Action Items

Owner

Lessons Learned

...

Consider prioritising

Jira Legacy
serverSystem JIRA
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-1383

...

Lessons Learned

  •  We need better monitoring to pre-emptively warn us before disk space limitations cause downtime.