/
2023-06-13 Some deployments to ECS failing

2023-06-13 Some deployments to ECS failing

Date

Jun 13, 2023

Authors

@Andy Dingley @Reuben Roberts

Status

Done

Summary

The deployment of some components to our preprod environment was failing due to apparent memory constraints

Impact

No new versions of these components could be deployed for a short period of time (~1 day)

Non-technical Description

Trainee Self-Service is made up of a number of distinct software components. These are hosted in the Cloud, in containers (AWS ECS). Containers are virtual machines, with associated resources such as memory and CPU. When the develops make changes to a component, it is built and redeployed into a container, which must have enough resources to run the component successfully. We found some components became unable to be deployed into their normal containers: we suspect that some underlying changes to the build pack that is used to assemble the components meant they required more memory than previously, meaning that the containers were no longer large enough to accommodate them, even though we had made no material changes to the component code.


Trigger

  • Deployment of run-of-the-mill component updates (and subsequent reversion to known-good builds) failed


Detection

  • Failure notifications in the #notifications-deployments Slack channel


Resolution

  • Temporary fix: increase container memory for the affected components from 0.5GB to 1GB


Timeline

All times in BST unless indicated

  • Jun 13, 2023: 14:39 - Failed deployment for tis-trainee-ndw-exporter reported in #notifications-deployments Slack channel, closely followed by failed deployments for tis-trainee-credentials and tis-trainee-user-management.

  • Jun 13, 2023: 16:13 - Reverts to the updates to these components also fail to deploy.

  • Jun 14, 2023: 11:25 - Redeploy of tis-trainee-ndw-exporter with ECS task configured with 1GB memory instead of 512MB succeeded

  • Jun 14, 2023: 13:00 - Redeployment of tis-trainee-credentials and tis-trainee-user-management with 1GB memory succeeded

Root Cause(s)

  • Deployments to preprod environment failing for some components (though not for others, e.g. tis-trainee-sync)

  • Logs for the failing components include lines such as:
    6/13/2023, 4:51:44 PM GMT+1 [31;1mERROR: [0mfailed to launch: exec.d: failed to execute exec.d file at path '/layers/paketo-buildpacks_bellsoft-liberica/helper/exec.d/memory-calculator': exit status 1
    6/13/2023, 4:51:44 PM GMT+1 Calculating JVM memory based on 616300K available memory
    6/13/2023, 4:51:44 PM GMT+1 For more information on this calculation, see https://paketo.io/docs/reference/java-reference/#memory-calculator
    6/13/2023, 4:51:44 PM GMT+1 unable to calculate memory configuration
    6/13/2023, 4:51:44 PM GMT+1 fixed memory regions require 632382K which is greater than 616300K available for allocation: -XX:MaxDirectMemorySize=10M, -XX:MaxMetaspaceSize=120382K, -XX:ReservedCodeCacheSize=240M, -Xss1M * 250 threads

  • Some buildpack versions have changed:
    (ndw exporter)
    Pass:
    paketo-buildpacks/ca-certificates 3.6.2
    paketo-buildpacks/bellsoft-liberica 10.2.3
    paketo-buildpacks/syft 1.30.1
    paketo-buildpacks/executable-jar 6.7.3
    paketo-buildpacks/dist-zip 5.6.2
    paketo-buildpacks/spring-boot 5.25.1
    Fail:
    paketo-buildpacks/ca-certificates 3.6.2
    paketo-buildpacks/bellsoft-liberica 10.2.4
    paketo-buildpacks/syft 1.31.0
    paketo-buildpacks/executable-jar 6.7.3
    paketo-buildpacks/dist-zip 5.6.3
    paketo-buildpacks/spring-boot 5.25.2

  • Further investigation needed to determine if these version changes affected memory requirements or calculation.


Action Items

Action Items

Owner

 

Action Items

Owner

 

Write-up investigation ticket for managing buildpack versioning

@Reuben Roberts

https://hee-tis.atlassian.net/browse/TIS21-4668

PRs for increased memory allocation for affected components (at least 3, but note that all small containers might throw errors when we attempt to deploy updates without bumping up the memory allocation)

@Reuben Roberts

DONE


Lessons Learned

  • There may be underlying changes to the way in which components are built, even if the code has not changed.