Incident Log
Reference process for responding to incidents (please update as appropriate / required):
Defect management is critical for ongoing product quality monitoring throughout the product’s lifecycle. Therefore, when dealing with live defect of any kind, the need to ensure process is properly followed sequentially to ensure issue(s) are thoroughly dealt with to enable more effective outcome is very important. As part of the process includes active participation of team working on the defect and the need for everyone to understand their role in every stage of the live defect process.
When anything appears to have fallen over, we need a clear efficient approach to providing information back to the POs / originators of the incident when they do. We especially need this when we will have a skeleton crew supporting the system over Christmas.
Sketch of a process for responding to issues:
/fire-fire
message which will Create a dedicated #fire-fire-[yyyy-mm-dd] channel in Slack and notify the team , create a LiveDefect Jira ticket and a Confluence incident log page. (Note: we’re looking at automating the creation of these three things with one action in future - watch this space). To keep focus of the team, DM is able to nominate himself/herself to help create Jira ticket and a incident log page but in the case where DM/PO are both unavailable, team are ‘self-organising’ team and are expected to agree amongst themselves the person to do the 2 tasks (ticket and incident log page)(for the uninitiated, we had a chicken hat in the office. Anyone responsible for a problem on Prod ceremonially wore the chicken hat as a badge of ‘honour’ for the day, got pride of place on the #wall-of-failure, and were thanked by everyone else for ‘taking one for the team’ - we all learn best from our mistakes!)
We need to respond fast to the live system going down. And, on the other hand, we need not panic and make mistakes that will compound the issue. And finally, we need an approach to address the underlying problem to prevent it happening again.
General guide to where services are hosted: What runs where?
Incident | Probable causes | Actions | Risks |
---|---|---|---|
User unable to login. | User trying to access at the same time as a job is going on, on the server, or being restarted, or a build is taking place. | Is it a Trust user (normally clear from the message alerting us to the incident in the first place), or an HEE user? Alistair and Seb when ‘triaging’ issues, will often ask incident reporters to clear their cache / reset password etc as part of a series of standard actions to take, that have historically sometimes fixed the problem (rather than distracting devs if unnecessary). | Few - users may have greater / lesser access than they need if you assign them more permissions. |
User creation. | HEE Admins usually resolve. Sometimes there’s a permission issue. | Essentially, point them to their LO Admin. Refer to the main Confluence page for this: Admin User Management (roles and permissions) to check whether permissions appear to have been set up as expected. | None. |
Generic upload. | Common errors have mainly been around using an old template or mis-entered dates e.g.
| Check Docker logs. Incrementally add validation rules and warnings (on the app / on the templates?) to trap these types of error. | None. |
No confirmation that the NDW Stage or Prod ETL processed successfully. | Another job failing / restarting and overlapping the schedule of the NDW ETLs. | Did another service fail / restart and interrupt one of the NDW ETLs? Once done, inform the NDW team to initiate downstream processes on the #tis-ndw-etl Slack channel. Associated note: Any Dev materially changing Stage needs to be aware this is likely to materially change the ETLs between TIS and NDW. Therefore reflecting any changes in Stage within the ETLs is now part of the definition of done of any ticket that materially changes Stage. | None - restarting the ETLs will simply refresh data. No issues with doing that ad-hoc outside the scheduled jobs. |
See #monitoring-esr and #monitoring-prod Slack channels. | Usually ESR sending data in extended Ascii character set and us reading in UTF-8. | Look at the Confluence page / team sharing Phil James did on handling this type of incident before he left. Switch our services to use extended Ascii set, if ESR can confirm this in their 000098 spec. | @Joseph (Pepe) Kelly / @Ashley Ransoo ? |
See #monitoring-prod Slack channel. | Temporary warning as a scheduled job is usually rendering the service unavailable. | Usually resolves itself. No action needed, unless the warning repeats, or is not accompanied by a ‘Resolved’ pair message. Then open up the details in the Slack notification. Check the logs for what's broken and restart the service if needed. Consider stop and starting, rather than restarting, the VM (because stop and start will encourage it to start up again on a new VM, restarting, restarts in in-situ on the current VM, and if the current VM is the problem, you’re no better off). Failing that, call for Ops help! | None. |
See #monitoring-prod Slack channel. | Overnight sync jobs require heavy processing:
| Normally comes with a ‘Resolved’ pair message very shortly afterwards. No action needed, unless you don’t see the ‘Resolved’ pair. Then check Rabbit to see whether things are running properly. | @Joseph (Pepe) Kelly ? |
See #monitoring-prod Slack channel. | If not, then it’s usually an indication there are a bunch of backup files / unused Docker images that can be deleted. | Occasionally this resolves itself. An auto-delete schedule on backup files / Docker images triggered by their age? Is this possible @John Simmons (Deactivated) / @Liban Hirey (Unlicensed) ? | None. |
See #monitoring-prod Slack channel. | Service ran out of memory. | Is it a Saturday? This message has appeared several Saturdays in a row. Note: there’s an In progress query in Teams (from Mon 14 Dec) that appears more of an internal HEE issue to resolve, but outside TIS team (data leads). | None. |
See #monitoring Slack channel. | Someone updating person when this job is running. | This kind of failure sends a @channel notification. You can run them from: | None. |
HEE colleagues raise issues they're having. Often their colleagues will corroborate, or resolve their problem within MS Teams. For anything else, | Data quality errors inherited from Intrepid. | Which environment are they having the problem on? See bulk/generic upload issue above. Check with Alistair / Seb and raise it in the #fire_fire Slack channel, stating whether or not you have been able to replicate. And create a new #fire-fire-yyyy-mm-dd channel if confirmed. | Depends on the issue. |
See #monitoring-prod Slack channel. “PersonPlacementTrainingBodyTrustJob” fails to start for the NIMDTA Sync service, which then stops the rest of the sync jobs from starting. Happens nightly |
| Manually restart the NIMDTA Sync jobs on https://nimdta.tis.nhs.uk/sync/ | None |
Emergency procedures in case of PRODUCTION capitulation.
This page is intended to be used to capture all issues encountered on the PRODUCTION environment to aid lessons learned and feed into continuous development
Slack: https://hee-nhs-tis.slack.com/
Jira issues: https://hee-tis.atlassian.net/issues/?filter=14213