2022-12-13: Hack day – Monitoring and Alerting
We decided to hold hack days on a more regular 6-week cycle. For this hack day we decided to hold a remote event, looking at a tech improvement challenge, one Discovery, two-team approach, and to follow the GDS Agile phases approach which worked well last Hack day.
General benefits of a hack-day:
practice collaboration outside the normal group of people you collaborate with
test your specialist skills, or, conversely, test your non-specialist skills
challenge a group of people that may not always work together to plan a time-bound task
maintain focus on one task for a whole day
Topic selection processWe came to organising this hack day quite late. As a result we decided to look at our backlog and select something that was an investigation-type ticket that would involve the whole team. @Andy Dingley , @John Simmons (Deactivated) and @Cai Willis assisted @Andy Nash (Unlicensed) in organising this one. We decided on investigating the longer term approach to Monitoring and alerting because it needs doing and has both an obvious technical element, and a more business-focused element of the wording of the alerts and onward comms. | Why did we chose Monitoring and alerting as a Tech improvement Hack day?We currently have monitoring and alerting offered by a range of tools / tool combinations that have evolved over time. We have recently been looking at streamlining our costs where possible; make it easier for everyone in the team to help out with anything it’s possible for them to help out with, and reduced our security risk overall. We also identified that current alerting is not pitched at non-techies. And even within those that are techie, the alerts are sometimes impenetrable. So there is a need to improve the alerting in terms of impact on user, and indication of how to resolve the problem especially. | |
Team membersThe TIS Team broke into two teams: One team looked specifically at an AWS Native approach to monitoring and logging The other team looked specifically as an AWS Manager approach. | ||
Subject matter expertsDuring discovery, we identified 4 main user groups:
| AWS Native (Cloudwatch, EventBridge, Lambdas):
| AWS Managed (Prometheus, Grafana, etc):
|
Structure of the day
| ||
Discovery / AlphaDiscussion summaryThe purpose of monitoring and alerting is
There’s a question of effort vs impact. If there is little to no impact on users, we shouldn’t put too great a percentage of our time into alerting PMs and users of these things. But things that do significantly impact users, we need to prioritise managing their expectations of how the issue is being handled (to reduce support requests coming through to us as well). User groups
Risks and assumptions (initial stab at the MVP)
Pull out any questions we have for users to check with them when as can at a later date. |
Hack day Retro |
---|
Hack day Retros are best carried out at the end of a Hack day, while everything is still fresh in everyone’s minds. We felt that would negatively impact on the ability to develop something on the day, so agreed to hold the Retro a couple of days after the hack day - giving everyone a chance to reflect, and inspect the outputs from each team, too. |
Retro summary
Start | Continue | Do differently | Stop | What tech stack do we feel we should go with? |
---|---|---|---|---|
|
|
|
| There’s a lot of questions around Prometheus and Grafana. Their usability is limited by our current need to go through Trustmarque. Rob indicated that changing this would require a complete re-procurement of AWS - a large piece of work. There wasn’t enough that came out of the day to conclusively say one stack was better than the other. However, when asked their preference, those that gave it (6) all said AWS Native felt the better option. |
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213