Date |
|
Authors | |
Status | Done |
Summary | The deployment of some components to our preprod environment was failing due to apparent memory constraints |
Impact | No new versions of these components could be deployed for a short period of time (~1 day) |
Non-technical Description
Trainee Self-Service is made up of a number of distinct software components. These are hosted in the Cloud, in containers (AWS ECS). Containers are virtual machines, with associated resources such as memory and CPU. When the develops make changes to a component, it is built and redeployed into a container, which must have enough resources to run the component successfully. We found some components became unable to be deployed into their normal containers: we suspect that some underlying changes to the build pack that is used to assemble the components meant they required more memory than previously, meaning that the containers were no longer large enough to accommodate them, even though we had made no material changes to the component code.
Trigger
Deployment of run-of-the-mill component updates (and subsequent reversion to known-good builds) failed
Detection
Failure notifications in the #notifications-deployments Slack channel
Resolution
Temporary fix: increase container memory for the affected components from 0.5GB to 1GB
Timeline
All times in BST unless indicated
: 14:39 - Failed deployment for tis-trainee-ndw-exporter reported in #notifications-deployments Slack channel, closely followed by failed deployments for tis-trainee-credentials and tis-trainee-user-management.
: 16:13 - Reverts to the updates to these components also fail to deploy.
: 11:25 - Redeploy of tis-trainee-ndw-exporter with ECS task configured with 1GB memory instead of 512MB succeeded
: 13:00 - Redeployment of tis-trainee-credentials and tis-trainee-user-management with 1GB memory succeeded
Root Cause(s)
Deployments to preprod environment failing for some components (though not for others, e.g. tis-trainee-sync)
Logs for the failing components include lines such as:
6/13/2023, 4:51:44 PM GMT+1 [31;1mERROR: [0mfailed to launch: exec.d: failed to execute exec.d file at path '/layers/paketo-buildpacks_bellsoft-liberica/helper/exec.d/memory-calculator': exit status 1
6/13/2023, 4:51:44 PM GMT+1 Calculating JVM memory based on 616300K available memory
6/13/2023, 4:51:44 PM GMT+1 For more information on this calculation, see https://paketo.io/docs/reference/java-reference/#memory-calculator
6/13/2023, 4:51:44 PM GMT+1 unable to calculate memory configuration
6/13/2023, 4:51:44 PM GMT+1 fixed memory regions require 632382K which is greater than 616300K available for allocation: -XX:MaxDirectMemorySize=10M, -XX:MaxMetaspaceSize=120382K, -XX:ReservedCodeCacheSize=240M, -Xss1M * 250 threads
Some buildpack versions have changed:
(ndw exporter)
Pass:
paketo-buildpacks/ca-certificates 3.6.2
paketo-buildpacks/bellsoft-liberica 10.2.3
paketo-buildpacks/syft 1.30.1
paketo-buildpacks/executable-jar 6.7.3
paketo-buildpacks/dist-zip 5.6.2
paketo-buildpacks/spring-boot 5.25.1
Fail:
paketo-buildpacks/ca-certificates 3.6.2
paketo-buildpacks/bellsoft-liberica 10.2.4
paketo-buildpacks/syft 1.31.0
paketo-buildpacks/executable-jar 6.7.3
paketo-buildpacks/dist-zip 5.6.3
paketo-buildpacks/spring-boot 5.25.2
Further investigation needed to determine if these version changes affected memory requirements or calculation.
Action Items
Action Items | Owner | |
---|---|---|
Write-up investigation ticket for managing buildpack versioning | TO DO | |
PRs for increased memory allocation for affected components (at least 3, but note that all small containers might throw errors when we attempt to deploy updates without bumping up the memory allocation) | DONE |
Lessons Learned
There may be underlying changes to the way in which components are built, even if the code has not changed.
Add Comment