Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Implementing

Date

Authors

Andy Dingley Joseph (Pepe) Kelly John Simmons (Deactivated)

Status

Done

Summary

Several TIS services, namely Generic Upload and ESR services, could not authenticate with Keycloak. Resulting in outages in those functions.

Impact

Admins not able to upload any bulk import/update spreadsheets, TODO: ESR impact? ESR Bi-directional not able to import/export changes?

...

This was resolved by giving the servers a way of finding the authentication sevice service (and any other service) that can be found from inside a single server or across multiple servers. Therefore the requests should always be able to find the correct route.

...

  • Generic Upload: reported by a user on Teams

  • ESR: detected by alertmanager but seen by Joseph (Pepe) Kelly when pausing alerts for work that was already being undertaken

...

  • - Blue stage server upgraded. TIS continued to function normally after this upgrade. Used to develop a sequence for the upgrade of other ec2 instances.

  • 10:30 to 15:13 - Green stage and monitoring server were upgraded. The build server was partially upgraded.

  • 14:xx - Message to James Harris querying whether it was possible to do bulk uploads, no indication that they were encountering an issue.

  • 17:15 - Paused production monitoring and begun applying upgrades to blue production.

  • 17:20 - Hit similar issues on production servers of packages remaining from images migrated to AWS but sorted out.

  • 18:20 - Prod inaccessible while being upgraded.

  • 19:04 - Validated that upgrade working on blue server and upgrade of green server began.

  • 19:42 - Prod appeared fully accessible with upgraded components, monitoring re-enabled.

  • 09:44 - TIS Admin reported, via TIS Support Channel, that they were getting an “Unknown Server Error” when performing bulk uploads.

  • 10:47 - Users were informed on TIS Support Channel that we were aware of a bulk upload issue affecting all users and were investigating.

  • 12:16 - Users were informed on TIS Support Channel that we had deployed a fix for bulk upload.

  • 15:06 - Networking change applied to production and workaround hotfix removed (after validating in stage environment).

  • 17:06 to 18:40 - ESR integration services re-enabled and monitored for processing. 1 container definition required modified networking

  • 11:55 - Verified that there were no files pending export and files had been produced

...

  • The services were no longer able to access Keycloak via the public URL, which resolved to a loopback address

  • The hosts file was not configured correctly (not so much of a root cause, but defiantly something that needed to be corrected once found)

  • The Apache/Docker upgrade caused/required some unexpected configuration changes

  • The change in major OS version (Ubuntu 16.04 to Ubuntu 18.04) looks like it reset the custom DNS settings we were using originally, reverting re-applying these have made the apps run as expected.

...

Action Items

Owner

add playbook/terraform config to fix DNS resolver

John Simmons (Deactivated)

move ESR applications off to ECS asap

John Simmons (Deactivated) Joseph (Pepe) Kelly

Check app records for ESR 7700 (not recorded as exported in TIS) → 3408 (which have been reconciled at some point) → 1,304 (Approved, deduplicated) → → 11 (generated with a “no-change” update)

Joseph (Pepe) Kelly

check ESR and TIS applications are using addresses that can be resolved nomatter what platform they are using

John Simmons (Deactivated)

...