2022-12-13: Hack day – Monitoring and Alerting

We decided to hold hack days on a more regular 6-week cycle. For this hack day we decided to hold a remote event, looking at a tech improvement challenge, one Discovery, two-team approach, and to follow the GDS Agile phases approach which worked well last Hack day.

General benefits of a hack-day:

  • practice collaboration outside the normal group of people you collaborate with

  • test your specialist skills, or, conversely, test your non-specialist skills

  • challenge a group of people that may not always work together to plan a time-bound task

  • maintain focus on one task for a whole day

Topic selection process

We came to organising this hack day quite late. As a result we decided to look at our backlog and select something that was an investigation-type ticket that would involve the whole team.

@Andy Dingley , @John Simmons (Unlicensed) and @Cai Willis assisted @Andy Nash (Unlicensed) in organising this one.

We decided on investigating the longer term approach to Monitoring and alerting because it needs doing and has both an obvious technical element, and a more business-focused element of the wording of the alerts and onward comms.

Why did we chose Monitoring and alerting as a Tech improvement Hack day?

We currently have monitoring and alerting offered by a range of tools / tool combinations that have evolved over time.

We have recently been looking at streamlining our costs where possible; make it easier for everyone in the team to help out with anything it’s possible for them to help out with, and reduced our security risk overall.

We also identified that current alerting is not pitched at non-techies. And even within those that are techie, the alerts are sometimes impenetrable. So there is a need to improve the alerting in terms of impact on user, and indication of how to resolve the problem especially.

Team members

The TIS Team broke into two teams:

One team looked specifically at an AWS Native approach to monitoring and logging

The other team looked specifically as an AWS Manager approach.

Subject matter experts

During discovery, we identified 4 main user groups:

  1. TIS Team ourselves (Problem Manager especially);

  2. Users of products affected by any issues (in all cases, it was felt that onward comms should be handled by Local Office administrators - as they have existing relationships with the end users);

  3. TIS leads (for the Products they work on); and

  4. Data Leads (where data is affected - notably stale data).

AWS Native (Cloudwatch, EventBridge, Lambdas):

  • AdeO 

  • Andy D 

  • Ash 

  • Cai 

  • Doris 

  • Ed 

  • James 

  • Kav 

  • Rob 

  • Steve 

  • Ade B 

AWS Managed (Prometheus, Grafana, etc):

  • Ade A 

  • Anita 

  • Catherine 

  • Jay 

  • John S 

  • JohnO 

  • Naz 

  • Pepe 

  • Reuben 

  • Saedhia 

  • Stan 

Structure of the day

  • 1 hour: discovery (who are the users? what are their problems (might be different problems for different users)? what are the risks and assumptions we need to shine a light on before committing to potentially wasted development effort?)

  • 1 hour: alpha experiments (testing assumptions and risks and narrowing down a range of possible solutions, to the preferred one to develop in beta)

  • 1 hour: lunch

  • 3 hours (incl. breaks): beta development (incrementing a solution - developing a bit, reconvening, developing a bit more, reconvening. Avoiding long periods of silo/solo working)

  • 1/2 hour: presentations back to each other

Discovery / Alpha

Discussion summary

The purpose of monitoring and alerting is

  1. to know when things have gone wrong

  2. to alert relevant people as soon possible

  3. to try to give us as much notice as possible to fix things before they affect users

There’s a question of effort vs impact. If there is little to no impact on users, we shouldn’t put too great a percentage of our time into alerting PMs and users of these things. But things that do significantly impact users, we need to prioritise managing their expectations of how the issue is being handled (to reduce support requests coming through to us as well).

User groups

  • Primary users: TIS Team

  • Users of products affected by any issues, via Local Office administrators

  • TIS Leads (for the Products they lead on)

  • Data Leads (where data is affected - especially stale data)

Risks and assumptions (initial stab at the MVP)

  • Ideal is for issues that can’t be auto-fixed, whoever sees the alert can understand, fix and handle onward comms.

  • Test out with who is on the call what alert messaging is clear enough.

  • What do the alerts look like? In Slack - Block Kit Builder.

  • Are there existing examples of what we would regard as ‘good/bad practice’?

  • Can we categorise types of alerts, by intended user audiences?

  • Need to understand strengths and weaknesses of each approach - use sample issues to handle as a way of testing this: e.g. The app is down. An overnight sync job has failed. Resource usage is nearing its limit.

 

Pull out any questions we have for users to check with them when as can at a later date.

Hack day Retro

Hack day Retro

Hack day Retros are best carried out at the end of a Hack day, while everything is still fresh in everyone’s minds. We felt that would negatively impact on the ability to develop something on the day, so agreed to hold the Retro a couple of days after the hack day - giving everyone a chance to reflect, and inspect the outputs from each team, too.

Retro summary

Start

Continue

Do differently

Stop

What tech stack do we feel we should go with?

  • Follow up on hack day work

  • Get SMEs in where possible

  • Avoid working on backlog tickets other than as spikes / investigations

  • Following discovery, pitch ideas for teams to tackle

  • Pre-discovery so everyone is up to speed at the start of the hack day

  • Work to the strengths of the team you have

  • Understand current as-is where appropriate

  • Involve everyone

  • Prizes are great motivators

  • Consider ice-breakers / warm up exercises

  • Using Service Standard Agile Phases (explain where necessary)

  • Not being scared to structure the day

  • Joint discovery and split teams for alpha / beta (for Hack day topics that merit shared understanding of the problem space)

  • Picking the topic before the day (but encouraging people to wait till after Discovery on the day to start thinking about solutions!)

  • Mixing up the teams

  • Solving real problems

  • Clearly work out what the problem is (during Discovery). Using double diamond perhaps to go broad and then narrow the focus. Also have different blueprints perhaps for different kinds of hack day (tech improvement, product development, team collaboration, etc)

  • Pre-Hack day prep-work. Include a clear focus for the day (tech improvement, product development, team collaboration, etc)

  • Being too ambitious. Consider the double diamond technique.

  • Focus on achievables, and stretch targets

  • Moving between Agile phases without a clear assessment of the phase you’re in (including, don’t forget, the possibility of ditching an idea from one phase to the next)

  • Having too large groups. Each team on this hack day was 11 people in size. This is at least 50% too many, and possibly twice the ideal size.

  • Having to procure our own pizzas - hack day budget!

There’s a lot of questions around Prometheus and Grafana. Their usability is limited by our current need to go through Trustmarque. Rob indicated that changing this would require a complete re-procurement of AWS - a large piece of work.

There wasn’t enough that came out of the day to conclusively say one stack was better than the other. However, when asked their preference, those that gave it (6) all said AWS Native felt the better option.

Actions

https://hee-tis.atlassian.net/browse/TIS21-3986