...
- 1. Notify incident originator (usually via Teams, or via Naz / Rob) you're looking into it. As first responder, take on, or suggest who should take on, the ‘coordinator’ role for the incident. This coordinator role would not get sucked into fixing the problem (unless they’re the only person around when it’s noticed), but would focus on coordinating the effort and the comms between teammates working on the issue and to Naz, Rob, the incident reporter, users and beyond.
- From the first reporter of the live defect, the need for DM/PM to delegate the role of a coordinator become paramount to ensure all steps throughout the process are properly followed (what is the process).
- 2. In slack, send a
/fire-fire
message which will Create a dedicated #fire-fire-[yyyy-mm-dd] channel in Slack and notify the team 👩🚒👨🚒🚒 , create a LiveDefect Jira ticket and a Confluence incident log page. (Note: we’re looking at automating the creation of these three things with one action in future - watch this space). To keep focus of the team, DM is able to nominate himself/herself to help create Jira ticket and a incident log page but in the case where DM/PO are both unavailable, team are ‘self-organising’ team and are expected to agree amongst themselves the person to do the 2 tasks (ticket and incident log page) - If the incident is data related, ensure you involve someone on the data side of things - James, BAs, Data analysts - so they can help assess impact and run any reports to that effect asap, before the incident is fixed and retrospective reporting on impact may be much more difficult.
- 3. Use the Incident log page to start recording timeline of events (first noticed, first alerted, initial actions taken, results, etc). Sometimes in the middle of a #fire-fire the temptation is to quickly fix, without making others aware of what you're doing. When these quick fixes don't work, it can seriously muddy the waters of the #fire-fire issue)
- 4. Ask for assistance (who, specifically, do you need to help; or does this need an 'all hands to the pump' approach?)
- 5. Look at logs (Depends on issue)
- 6. Determine the likely causes (see table below as a starter)
- 7. Nothing obvious? Who can / needs to help with the RCA? To help in this area it is worth visiting your description of the problem to ensure all the obvious are question then revisit RCA and with this you want to ensure you have the right expertise within the team for a successful outcome.
- 8. Who needs informing (within the team and without)? How and when do they need informing (e.g. Slack, Email, MS Teams, by phone | e.g. As soon as the problem is noticed / as soon as the team starts responding / whenever there's significant progress on a resolution / when the team needs assistance from outside the team / when the problem is confirmed as resolved).
- 9. Once a quick fix is in place, do you need to ticket up any longer term work to prevent the problem in future (bear in mind, whoever leads on addressing the incident is probably now best placed to determine the priority of such longer-term fix tickets)
- 11. If anyone feels responsible for the incident, we are working on a virtual chicken hat to wear! 🐔
(for the uninitiated, we had a chicken hat in the office. Anyone responsible for a problem on Prod ceremonially wore the chicken hat as a badge of ‘honour’ for the day, got pride of place on the #wall-of-failure, and were thanked by everyone else for ‘taking one for the team’ - we all learn best from our mistakes!) - 12 . Active engagement of PM/DM in supporting of coordinator role is essential to help manage 'live defects’ process amongst the team for effective outcome and improvement of the process to better manage any future ‘Live Defect’.
...
Incident | Probable causes | Actions | Risks |
---|---|---|---|
User unable to login. | User trying to access at the same time as a job is going on, on the server, or being restarted, or a build is taking place. | Is it a Trust user (normally clear from the message alerting us to the incident in the first place), or an HEE user? Alistair and Seb when ‘triaging’ issues, will often ask incident reporters to clear their cache / reset password etc as part of a series of standard actions to take, that have historically sometimes fixed the problem (rather than distracting devs if unnecessary). | Few - users may have greater / lesser access than they need if you assign them more permissions. |
User creation. | HEE Admins usually resolve. Sometimes there’s a permission issue. | Essentially, point them to their LO Admin. Refer to the main Confluence page for this: Admin User Management (roles and permissions) to check whether permissions appear to have been set up as expected. | None. |
Generic upload. | Common errors have mainly been around using an old template or mis-entered dates e.g.
| Check Docker logs. Incrementally add validation rules and warnings (on the app / on the templates?) to trap these types of error. | None. |
No confirmation that the NDW Stage or Prod ETL processed successfully. | Another job failing / restarting and overlapping the schedule of the NDW ETLs. | Did another service fail / restart and interrupt one of the NDW ETLs? Once done, inform the NDW team to initiate downstream processes on the #tis-ndw-etl Slack channel. Associated note: Any Dev materially changing Stage needs to be aware this is likely to materially change the ETLs between TIS and NDW. Therefore reflecting any changes in Stage within the ETLs is now part of the definition of done of any ticket that materially changes Stage. | None - restarting the ETLs will simply refresh data. No issues with doing that ad-hoc outside the scheduled jobs. |
See #monitoring-esr and #monitoring-prod Slack channels. | Usually ESR sending data in extended Ascii character set and us reading in UTF-8. | Look at the Confluence page / team sharing Phil James did on handling this type of incident before he left. Switch our services to use extended Ascii set, if ESR can confirm this in their 000098 spec. | |
See #monitoring-prod Slack channel. | Temporary warning as a scheduled job is usually rendering the service unavailable. | Usually resolves itself. No action needed, unless the warning repeats, or is not accompanied by a ‘Resolved’ pair message. Then open up the details in the Slack notification. Check the logs for what's broken and restart the service if needed. Consider stop and starting, rather than restarting, the VM (because stop and start will encourage it to start up again on a new VM, restarting, restarts in in-situ on the current VM, and if the current VM is the problem, you’re no better off). Failing that, call for Ops help! | None. |
See #monitoring-prod Slack channel. | Overnight sync jobs require heavy processing:
| Normally comes with a ‘Resolved’ pair message very shortly afterwards. No action needed, unless you don’t see the ‘Resolved’ pair. Then check Rabbit to see whether things are running properly. | |
See #monitoring-prod Slack channel. | If not, then it’s usually an indication there are a bunch of backup files / unused Docker images that can be deleted. | Occasionally this resolves itself. An auto-delete schedule on backup files / Docker images triggered by their age? Is this possible John Simmons (Unlicensed) / Liban Hirey (Unlicensed) ? | None. |
See #monitoring-prod Slack channel. | Service ran out of memory. | Is it a Saturday? This message has appeared several Saturdays in a row. Note: there’s an In progress query in Teams (from Mon 14 Dec) that appears more of an internal HEE issue to resolve, but outside TIS team (data leads). | None. |
See #monitoring Slack channel. | Someone updating person when this job is running. | This kind of failure sends a @channel notification. You can run them from: | None. |
HEE colleagues raise issues they're having. Often their colleagues will corroborate, or resolve their problem within MS Teams. For anything else, 👉 | Data quality errors inherited from Intrepid. | Which environment are they having the problem on? See bulk/generic upload issue above. Check with Alistair / Seb and raise it in the #fire_fire Slack channel, stating whether or not you have been able to replicate. And create a new #fire-fire-yyyy-mm-dd channel if confirmed. | Depends on the issue. |
See #monitoring-prod Slack channel. “PersonPlacementTrainingBodyTrustJob” fails to start for the NIMDTA Sync service, which then stops the rest of the sync jobs from starting. Happens nightly | Manually restart the NIMDTA Sync jobs on https://nimdta.tis.nhs.uk/sync/ | None |
...