We decided to hold hack days on a more regular 6-week cycle. For this hack day we decided to hold a remote event, looking at a tech improvement challenge, one Discovery, two-team approach, and to follow the GDS Agile Delivery framework phases approach which worked well last Hack day.
General benefits of a hack-day:
practice collaboration outside the normal group of people you collaborate with
test your specialist skills, or, conversely, test your non-specialist skills
challenge a group of people that may not always work together to plan a time-bound task
maintain focus on one task for a whole day
Table of contents
Table of Contents | ||||
---|---|---|---|---|
|
Topic selection processWe came to organising this hack day quite late. As a result we decided to look at our backlog and select something that was an investigation-type ticket that would involve the whole team. Andy Dingley , John Simmons (Deactivated) and Cai Willis assisted Andy Nash (Unlicensed) in organising this one. We decided on investigating the longer term approach to Monitoring and alerting because it needs doing and has both an obvious technical element, and a more business-focused element of the wording of the alerts and onward comms. | Why did we chose Monitoring and alerting as a Tech improvement Hack day?We currently have monitoring and alerting offered by a range of tools / tool combinations that have evolved over time. We have recently been looking at streamlining our costs where possible; make it easier for everyone in the team to help out with anything it’s possible for them to help out with, and reduced our security risk overall. We also identified that current alerting is not pitched at non-techies. And even within those that are techie, the alerts are sometimes impenetrable. So there is a need to improve the alerting in terms of impact on user, and indication of how to resolve the problem especially. | |
Team membersThe TIS Team broke into two teams: One team looked specifically at an AWS Native approach to monitoring and logging The other team looked specifically as an AWS Manager approach. | ||
Subject matter expertsDuring discovery, we identified 4 main user groups:
| AWS Native (Cloudwatch, EventBridge, Lambdas):
| AWS Managed (Prometheus, Grafana, etc):
|
Structure of the day
| ||
Discovery / AlphaDiscussion summaryThe purpose of monitoring and alerting is
There’s a question of effort vs impact. If there is little to no impact on users, we shouldn’t put too great a percentage of our time into alerting PMs and users of these things. But things that do significantly impact users, we need to prioritise managing their expectations of how the issue is being handled (to reduce support requests coming through to us as well). User groups
Risks and assumptions (initial stab at the MVP)
Pull out any questions we have for users to check with them when as can at a later date. |
Hack day Retro |
---|
Hack day Retros are best carried out at the end of a Hack day, while everything is still fresh in everyone’s minds. Because of time and teleconferencing constraints, this was unfortunately not possible. Many participants were sick / on leave in the following few days. |
Excerpt | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Retro summary
|
Excerpt | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Actions
|