Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Rob Pink Joseph (Pepe) Kelly

Status

Summary

Bulk upload not uploading and experiencing timeout error (401)Upload service was deployed with out of date configuration values, which made it unusable.

Impact

Bulk upload was unavailable for c. 3-4 working hours.

Jira Legacy
serverSystem Jira
serverId4c843cd5-e5a9-329d-ae88-66091fcfe3c7
keyTIS21-6637

Table of Contents

Non-technical Description

User A user attempting to do a bulk upload received a timeout error (401)repeatedly saw a message that the “server took too long” and then refreshed. On investigation, it was found that an out-of-date piece of configuration information that was released prior to the issue being experienced. This impacted the bulk upload process. Once the up-to-date configuration was loaded, the problem was addressedthe afternoon before. The latest configuration was made available which restored service.

...

Trigger

Deploying / Approving a deployment

...

Detection

User alerted via Teams

...

Resolution

  • Synchronised infrastructure definition from IaC repository used by the build process and reran the CICD pipeline.

...

Timeline

All times BST unless otherwise indicated.

  • “Infrastructure definitions” left out of date.

  • ~12:02-13:30 The configuration used for deploying was manually edited and the pipeline executed. It was then released to production.

  • 14:44 User reported problem.  

  • 09:10 - 09:26 The Infrastructure Code definitions were updated where they are used by the build process, the pipeline was run and users notified.

5 Whys (or other analysis of Root Cause)

The page was refreshing because API calls returned 401 errors.

...

Actions for an earlier Live Defect had not been completed and this meant that builds were using an earlier copy of our infrastructure definition.

...

Action Items

Action Items

Owner

“Unresolve” Build Server card until it has been fully resolved

Run bulk upload on a Cloud Native serviceJoseph (Pepe) Kelly

Repair persistent logging for bulk upload

Investigate modifying response codes when services are unavailable

...

...

Lessons Learned