Defect management is critical for ongoing product quality monitoring throughout the product’s lifecycle. Therefore, when dealing with live defect of any kind, the need to ensure process is properly followed sequentially to ensure issue(s) are thoroughly dealt with to enable more effective outcome is very important. As part of the process includes active participation of team working on the defect and the need for everyone to understand their role in every stage of the live defect process.
When anything appears to have fallen over, we need a clear efficient approach to providing information back to the POs / originators of the incident when they do. We especially need this when we will have a skeleton crew supporting the system over Christmas.
Sketch of a process for responding to issues:
/fire-fire
message which will Create a dedicated #fire-fire-[yyyy-mm-dd] channel in Slack and notify the team , create a LiveDefect Jira ticket and a Confluence incident log page. (Note: we’re looking at automating the creation of these three things with one action in future - watch this space). To keep focus of the team, DM is able to nominate himself/herself to help create Jira ticket and a incident log page but in the case where DM/PO are both unavailable, team are ‘self-organising’ team and are expected to agree amongst themselves the person to do the 2 tasks (ticket and incident log page)General guide to where services are hosted: What runs where?
Incident | Probable causes | Actions | Risks |
---|---|---|---|
User unable to login. | User trying to access at the same time as a job is going on, on the server, or being restarted, or a build is taking place. | Is it a Trust user (normally clear from the message alerting us to the incident in the first place), or an HEE user? Alistair and Seb when ‘triaging’ issues, will often ask incident reporters to clear their cache / reset password etc as part of a series of standard actions to take, that have historically sometimes fixed the problem (rather than distracting devs if unnecessary). | Few - users may have greater / lesser access than they need if you assign them more permissions. |
User creation. | HEE Admins usually resolve. Sometimes there’s a permission issue. | Essentially, point them to their LO Admin. Refer to the main Confluence page for this: Admin User Management (roles and permissions) to check whether permissions appear to have been set up as expected. | None. |
Generic upload. | Common errors have mainly been around using an old template or mis-entered dates e.g.
| Check Docker logs. Incrementally add validation rules and warnings (on the app / on the templates?) to trap these types of error. | None. |
No confirmation that the NDW Stage or Prod ETL processed successfully. | Another job failing / restarting and overlapping the schedule of the NDW ETLs. | Did another service fail / restart and interrupt one of the NDW ETLs? Once done, inform the NDW team to initiate downstream processes on the #tis-ndw-etl Slack channel. Associated note: Any Dev materially changing Stage needs to be aware this is likely to materially change the ETLs between TIS and NDW. Therefore reflecting any changes in Stage within the ETLs is now part of the definition of done of any ticket that materially changes Stage. | None - restarting the ETLs will simply refresh data. No issues with doing that ad-hoc outside the scheduled jobs. |
See #monitoring-esr and #monitoring-prod Slack channels. | Usually ESR sending data in extended Ascii character set and us reading in UTF-8. | Look at the Confluence page / team sharing Phil James did on handling this type of incident before he left. Switch our services to use extended Ascii set, if ESR can confirm this in their 000098 spec. | |
See #monitoring-prod Slack channel. | Temporary warning as a scheduled job is usually rendering the service unavailable. | Usually resolves itself. No action needed, unless the warning repeats, or is not accompanied by a ‘Resolved’ pair message. Then open up the details in the Slack notification. Check the logs for what's broken and restart the service if needed. Consider stop and starting, rather than restarting, the VM (because stop and start will encourage it to start up again on a new VM, restarting, restarts in in-situ on the current VM, and if the current VM is the problem, you’re no better off). Failing that, call for Ops help! | None. |
See #monitoring-prod Slack channel. | Overnight sync jobs require heavy processing:
| Normally comes with a ‘Resolved’ pair message very shortly afterwards. No action needed, unless you don’t see the ‘Resolved’ pair. Then check Rabbit to see whether things are running properly. | |
See #monitoring-prod Slack channel. | If not, then it’s usually an indication there are a bunch of backup files / unused Docker images that can be deleted. | Occasionally this resolves itself. An auto-delete schedule on backup files / Docker images triggered by their age? Is this possible John Simmons (Deactivated) / Liban Hirey (Unlicensed) ? | None. |
See #monitoring-prod Slack channel. | Service ran out of memory. | Is it a Saturday? This message has appeared several Saturdays in a row. Note: there’s an In progress query in Teams (from Mon 14 Dec) that appears more of an internal HEE issue to resolve, but outside TIS team (data leads). | None. |
See #monitoring Slack channel. | Someone updating person when this job is running. | This kind of failure sends a @channel notification. You can run them from: | None. |
HEE colleagues raise issues they're having. Often their colleagues will corroborate, or resolve their problem within MS Teams. For anything else, | Data quality errors inherited from Intrepid. | Which environment are they having the problem on? See bulk/generic upload issue above. Check with Alistair / Seb and raise it in the #fire_fire Slack channel, stating whether or not you have been able to replicate. And create a new #fire-fire-yyyy-mm-dd channel if confirmed. | Depends on the issue. |
See #monitoring-prod Slack channel. “PersonPlacementTrainingBodyTrustJob” fails to start for the NIMDTA Sync service, which then stops the rest of the sync jobs from starting. Happens nightly | Manually restart the NIMDTA Sync jobs on https://nimdta.tis.nhs.uk/sync/ | None |
Emergency procedures in case of PRODUCTION capitulation.