Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Date

Authors

Joseph (Pepe) Kelly, Marcello Fabbri (Unlicensed)

Status

LiveDefect done. Investigating mitigations for the futureDone.

Summary

Some exported placements show an unknown ESR status (?) on TIS instead of correctly displaying their exported status (✔)

Impact

Inaccurate information regarding some placement’s ESR status

...

Non-technical Description

...

The TIS-ESR interface exports data to ESR daily

...

. All the Applicants were sent to ESR as expected. However the interface failed to successfully communicate the completed export of some Placements to TIS, due to TIS’s momentary unavailability.

As a result, these Placement’s status on TIS remained unclear

...

. They appeared in lists of placements with a question mark on the frontend (?) instead of displaying the correct exported status (marked with a tick ✔).

The process has been updated so that when updates fail because of connection problems, there are reasonable attempts to re-attempt the communication to TIS.

...

Trigger

  • . TCS momentarily unavailable (updates sent via REST calls not processed)

...

Detection

  • .Great number of messages being dumped in the Dead Letter Queue (triggering a monitor alert)

...

Resolution

  • .When Placements are exported to ESR, the Data Export service sends a message via RabbitMQ queue to the Inbound Data Writer service.

...

Timeline

  • : : 15:33 BST - Slack notification regarding a high volume of messages in the Dead Letter Queue

  • :

  • :

  • :

  • : 15:58 BST - Poor interaction between the InboundDataWriterService and TCS identified as the culprit for messages being discarded when info hadn’t successfully been recorded into TIS

  • : Fix put in place to increase resilience of the InboundDataWriterService when interacting with TCS

  • : Affected data amended in order to display accurate ESR export status

Root Cause(s)

  • .When Placements are exported to ESR, the Data Export service sends a message via RabbitMQ queue to the Inbound Data Writer serviceTCS didn’t have the data to show the tick to say the placement was updated.

  • The Inbound Data Writer service normally sends the updates to TCS via REST callfailed, which is responsible for updating the PlacementEsrEvent table where this data’s stored.

  • The message was not requeued (therefore re-processing was not attempted), and the updates where not applied.

  • TCS was momentarily unavailable* right when the Inbound Data Writer service sent the REST call and didn’t accommodate that call. It didn’t update anything.

  • The Inbound Data Writer service, receiving a specific error in response to treated TCS’s unavailability , had a clause in place aimed at not requeuing the message in such case.The message was not requeued (therefore re-processing was not attempted), and the updates where not applied.like a problem with the message which aren’t requeued.

  • *why was TCS unavailable? Could be a number of reasons every day different. For example, on April 6, 2021 there was a live defect between 2 and 2:30 pm which affected the stage and prod environment, and some RuntimeExceptions where thrown on Stage and affected the ESR-TIS interaction, i.e.:
    "z-exception-type": "java.lang.RuntimeException",
    "x-exception-message": "Throwing exception so that this message is not requeued",
    "eventSourceTimestamp": "2021-04-06T13:10:00.178Z",

Action Items

Action Items

Owner

Status

Fix current Placements whose status is currently inaccurate

Edward Barclay ongoing

https://hee-tis.atlassian.net/browse/TIS21-1650

Make the Inbound Data Writer service more resilient so it requeues the messages when TCS doesn’t respond

Marcello Fabbri (Unlicensed)

donehttps://hee-tis.atlassian.net/browse/TIS21-1651

Check elsewhere in the ESR interface for places where requeuing would be appropriate

Marcello Fabbri (Unlicensed)

ongoinghttps://hee-tis.atlassian.net/browse/TIS21-1746

...

Lessons Learned

  • Consider more carefully when it’s appropriate to requeue a message (re-attempt processing it) and when it’s ok not to requeue a message.