The moon on a stick

The development team came together on 19th August after the first refinement session to come up with some ideas of how to re architect the system in terms of its integration with ESR

We first started off with a list of good and bad things about the current implementation (this was then used later on after the design session)

We came up with the following list (its a little unclear so its typed out too)

The GoodThe Bad
It worksMany points of failure in the Infrastructure
Meets requirementsCSV files
Confirmation filesScheduled jobs/bulk
Logging in terms of data (we keep the CSV files)Inter dependant jobs
Extensive documentationNot resilient

Reporting of errors is non existent


FTP

Bad / Slow feedback loop

Difficult to debug

Logging of personal details / GDPR issues?

We don't have N3 access

File structure of DAT files (no schema validation)

Documentation is incorrect

Resource intensive

Cannot deal with large data sets

Complex code / bad config

Complex directory structure in AZ

Jobs run out of hours

Not real time

Scenarios

With this in mind, we then thought about what the possibilities in terms of what could be done on this project. We came up with a spectrum that consisted of everything can change (bring in new technology/languages/frameworks) with new designs and rewrites on one end and the other end containing no ESR interface changes at all with all work being done by the TIS team. This then lead to some of the more likely middle ground where we think we might be.

The most likely scenario we thought that would happen was an integration layer, where the TIS team would be able to rewrite the some/all parts of the ESR code/services while writing an integration layer that will translate the ESR system to something thats a lot more nicer to use

The Design

So the follow is what was created:

We started out thinking that the whole project was a blank canvas and that we can start fresh with any ideas. Once that was done, we downgraded and turned it to a system where we can have an integration layer. 

On the left is TIS, with its various services (TCS, ESR etc). We then conceptualised that we'll get data is a form (this was any form whether it'd be files/requests/remote calls etc) on the right.

We had a question on whether the services in TIS we prescribed to be up (running) 100% of the time and the answer to that will have an effect on the design. The answer was that TIS is expected to be working between the hours 8am - 6pm and that any down time outside of these hours will be dealt with within working hours. This then made the requirements such that we need a system to "buffer" or "queue" in bound data. A number of systems were then suggested (Active MQ, Rabbit MQ, Kafka, JMS, DB) 

With a system now defined to segregate both TIS and ESR, focus was then aimed at what form this data will be in and how to process it. We came up with maybe one or more consumers reading off the queue with data being a small and atomic as possible. This will help with both being able to deal with large amounts of data (as there will be the ability to "fan out" consumers) and dealing with problematic records as you'll be able to fail a single records rather than a whole file.

With this consumer reading data from the queue and possibly making rest calls to other TIS services, it should have enough data to process the message. Once processed, it will push the processed records back onto the queue for ESR to pick up.

Two other areas were also identified as areas that will need to process and push data onto the queue, this was scheduled time based jobs that need to occur on a daily basis (the notification job triggered by jenkins) as well as actions made by users on TIS that relate to Posts and Placements. These events could possibly push messages onto a queue with information that changed and have other Consumers pick up and process, or the processing could happen within those services and have the result pushed to the queue for ESR to pick up.

Focus was then aimed at Auditing, Monitoring, Debugging and security. We came up with the concept of saving all inbound and outbound data so that we have a single place to validate what was send to TIS and From. This will make conversing with ESR easier as we will have facts easily available via a frontend that pulls this information

Integration Layer

With this designed, we thought about the most likely scenario, in that we will still need to build a system that deals with files rather than messages. This then introduces a few more services that sit in the other side of the queue, reading the files and splitting them before placing the data on the queue. Another service will the spin up at the appropriate times to consume all the message in the queue for ESR and write them to a file for ESR to pick up. This layer will then be the piece that will be removed in the future when ESR finally catches up to the world of API consumption.

Dual Runs

For the sake of testing and validation, we can run this new system in conjunction with the existing system but have the outputs placed in a different store so that we can compare and contrast the output.

Questions

There were a few good questions that arose where we couldn't answer right away. These are the following

  • The data sent from ESR - are there any dependencies between each records?
    • Do we need to process them in a FIFO manner?
  • Should we use this opportunity to split TCS a little and take out Posts and Placements into its own service so that we don't stress TCS too much
  • How should we deal with failover?
  • How large of a queue do we need?
    • How big is the inbound and outbound data?
  • If we're dealing with files, do we care about security in terms of viruses
  • Json datasets and Json schema, can we change ESR to validate inbound data
  • Idempotence, should we implement it
  • Should we implement correlation IDs

Reflection

Once we had some form of design, we looked back at The Good, The Bad list and went through the Bad's to see if we will fix these issues. We concluded that most of them we will be fixed and the rest we will not make any more worse.

Risks

There are a number of risks that we may need to address with this architecture 

  • Increase network traffic
  • Smaller record processing increase load on DB
  • Increase of cost for the messaging system
  • New technology that the team will need to learn
  • Storage costs with auditing if they aren't handled properly
  • ESR probably cant deal with duplicates which can happen if multiple updates happen on a placement multiple times

A whole new world

The following diagram depicts what we expect ESR will be able to do with minimal work. It still works with files but at least we'll be able to remove all of the scheduled jobs and FTP scripts 

The ideal scenario here would be that ESR will be able to create messages and place them on the queue, this will negate the need of having an integration layer and therefore any feedback would be almost instantaneous rather than waiting until the next scheduled run to see the results. It will also remove all the complexity of moving files around and concerns about the network connectivity of N3

The following are other scenarios which will be on the nicer part of the previously mentioned scale

API based approach

Direct Messaging Approach

Synchronous API Approach