Incident Log


Reference process for responding to incidents (please update as appropriate / required):

 

Defect management is critical for ongoing product quality monitoring throughout the product’s lifecycle. Therefore, when dealing with live defect of any kind, the need to ensure process is properly followed sequentially to ensure issue(s) are thoroughly dealt with to enable more effective outcome is very important. As part of the process includes active participation of team working on the defect and the need for everyone to understand their role in every stage of the live defect process.

When anything appears to have fallen over, we need a clear efficient approach to providing information back to the POs / originators of the incident when they do. We especially need this when we will have a skeleton crew supporting the system over Christmas.

Sketch of a process for responding to issues:

How to respond to a Live Defect in TIS.png
How to respond to a Live Defect in TIS?
1. Notify incident originator (usually via Teams, or via Naz / Rob) you're looking into it. As first responder, take on, or suggest who should take on, the ‘coordinator’ role for the incident. This coordinator role would not get sucked into fixing the problem (unless they’re the only person around when it’s noticed), but would focus on coordinating the effort and the comms between teammates working on the issue and to Naz, Rob, the incident reporter, users and beyond.
From the first reporter of the live defect, the need for DM/PM to delegate the role of a coordinator become paramount to ensure all steps throughout the process are properly followed (what is the process).
2. In slack, send a /fire-fire message which will Create a dedicated #fire-fire-[yyyy-mm-dd] channel in Slack and notify the team , create a LiveDefect Jira ticket and a Confluence incident log page. (Note: we’re looking at automating the creation of these three things with one action in future - watch this space). To keep focus of the team, DM is able to nominate himself/herself to help create Jira ticket and a incident log page but in the case where DM/PO are both unavailable, team are ‘self-organising’ team and are expected to agree amongst themselves the person to do the 2 tasks (ticket and incident log page)
If the incident is data related, ensure you involve someone on the data side of things - James, BAs, Data analysts - so they can help assess impact and run any reports to that effect asap, before the incident is fixed and retrospective reporting on impact may be much more difficult.
3. Use the Incident log page to start recording timeline of events (first noticed, first alerted, initial actions taken, results, etc). Sometimes in the middle of a #fire-fire the temptation is to quickly fix, without making others aware of what you're doing. When these quick fixes don't work, it can seriously muddy the waters of the #fire-fire issue)
4. Ask for assistance (who, specifically, do you need to help; or does this need an 'all hands to the pump' approach?)
5. Look at logs (Depends on issue)
6. Determine the likely causes (see table below as a starter)
7. Nothing obvious? Who can / needs to help with the RCA? To help in this area it is worth visiting your description of the problem to ensure all the obvious are question then revisit RCA and with this you want to ensure you have the right expertise within the team for a successful outcome.
8. Who needs informing (within the team and without)? How and when do they need informing (e.g. Slack, Email, MS Teams, by phone | e.g. As soon as the problem is noticed / as soon as the team starts responding / whenever there's significant progress on a resolution / when the team needs assistance from outside the team / when the problem is confirmed as resolved).
9. Once a quick fix is in place, do you need to ticket up any longer term work to prevent the problem in future (bear in mind, whoever leads on addressing the incident is probably now best placed to determine the priority of such longer-term fix tickets)
11. If anyone feels responsible for the incident, we are working on a virtual chicken hat to wear!
(for the uninitiated, we had a chicken hat in the office. Anyone responsible for a problem on Prod ceremonially wore the chicken hat as a badge of ‘honour’ for the day, got pride of place on the #wall-of-failure, and were thanked by everyone else for ‘taking one for the team’ - we all learn best from our mistakes!)
12 . Active engagement of PM/DM in supporting  of coordinator role  is essential to help manage 'live defects’ process amongst the team for effective outcome and improvement of the process to better manage any future ‘Live Defect’.

We need to respond fast to the live system going down. And, on the other hand, we need not panic and make mistakes that will compound the issue. And finally, we need an approach to address the underlying problem to prevent it happening again.

General guide to where services are hosted: https://hee-tis.atlassian.net/wiki/spaces/NTCS/pages/3358621697

Incident

Probable causes

Actions

Risks

Incident

Probable causes

Actions

Risks

User unable to login.
No Slack notifications.
Usually alerted by users in Teams.
Rare.

User trying to access at the same time as a job is going on, on the server, or being restarted, or a build is taking place.

Is it a Trust user (normally clear from the message alerting us to the incident in the first place), or an HEE user?
Is it TIS, TISSS, Reval, NI?
Check the User Management Docker logs - often a roles / permissions issue. Sometimes incorrect password entered.
KeyCloak services may be down - check KeyCloak Docker logs.
On UM, Add permissions if necessary (double check with Alistair / Seb first, if possible). Reset user password if necessary.
Ask user to try again.

Alistair and Seb when ‘triaging’ issues, will often ask incident reporters to clear their cache / reset password etc as part of a series of standard actions to take, that have historically sometimes fixed the problem (rather than distracting devs if unnecessary).
We shouldn’t, however, need to ask users to clear their cache or reset their passwords (unless they’ve locked themselves out - in which case we should be alerted about that problem specifically).
If either of these actions do actually sort any problems out, then we also need to mitigate against them.
Cache-busting ticket needed? Effort vs amount of time wasted on this kind of response vs #user affected? ?

Few - users may have greater / lesser access than they need if you assign them more permissions.

User creation.
No Slack notifications.
We’d normally be approached directly for this (email / Teams).
Rare.

HEE Admins usually resolve. Sometimes there’s a permission issue.

Essentially, point them to their LO Admin.

Refer to the main Confluence page for this: Admin User Management (roles and permissions) to check whether permissions appear to have been set up as expected.

None.

Generic upload.
None in any Slack channels.
We’d normally be approached via Teams.
Quite common.

Common errors have mainly been around using an old template or mis-entered dates e.g.

  • 1019-06-12, vs

  • 2019-06-12.

Check Docker logs.
Find offending file.
Check template for glaring manual errors.

Incrementally add validation rules and warnings (on the app / on the templates?) to trap these types of error.

None.

No confirmation that the NDW Stage or Prod ETL processed successfully.
See #monitoring-ndw channel.
Very rare now.

Another job failing / restarting and overlapping the schedule of the NDW ETLs.

Did another service fail / restart and interrupt one of the NDW ETLs?
Restart the ETL.

Once done, inform the NDW team to initiate downstream processes on the #tis-ndw-etl Slack channel.

Associated note: Any Dev materially changing Stage needs to be aware this is likely to materially change the ETLs between TIS and NDW. Therefore reflecting any changes in Stage within the ETLs is now part of the definition of done of any ticket that materially changes Stage.

None - restarting the ETLs will simply refresh data. No issues with doing that ad-hoc outside the scheduled jobs.

See #monitoring-esr and #monitoring-prod Slack channels.
“AWS RabbitMQ Prod bad data received from ESR in CSV file”.
Common.

Usually ESR sending data in extended Ascii character set and us reading in UTF-8.

Look at the Confluence page / team sharing Phil James did on handling this type of incident before he left.
Email david.mayall@nhs.net with the file name (or error message) asking them to re-submit the amended file or make a subsequent file as appropriate.

Switch our services to use extended Ascii set, if ESR can confirm this in their 000098 spec.
Python script being worked on (as of 2020-12-16: John Simmons) to automate(?) the fixing.
Java FileReader can use ISO-8859-1 or other common character sets.

@Joseph (Pepe) Kelly / @Ashley Ransoo ?

See #monitoring-prod Slack channel.
“AWS Monitor service non-operational”.
Common.

Temporary warning as a scheduled job is usually rendering the service unavailable.

Usually resolves itself.

No action needed, unless the warning repeats, or is not accompanied by a ‘Resolved’ pair message. Then open up the details in the Slack notification. Check the logs for what's broken and restart the service if needed. Consider stop and starting, rather than restarting, the VM (because stop and start will encourage it to start up again on a new VM, restarting, restarts in in-situ on the current VM, and if the current VM is the problem, you’re no better off). Failing that, call for Ops help!

None.

See #monitoring-prod Slack channel.
“AWS RabbitMQ Prod High messages waiting”.
Very common.

Overnight sync jobs require heavy processing:

  • PersonPlacementEmploying
    BodyTrustJob;

  • Person sync job.

Normally comes with a ‘Resolved’ pair message very shortly afterwards.

No action needed, unless you don’t see the ‘Resolved’ pair. Then check Rabbit to see whether things are running properly.

@Joseph (Pepe) Kelly ?

See #monitoring-prod Slack channel.
“AWS Server storage is almost full“.
Not rare.

If not, then it’s usually an indication there are a bunch of backup files / unused Docker images that can be deleted.

Occasionally this resolves itself.
If not, call in the cavalry: an Ops guy needs to clear out some space!

An auto-delete schedule on backup files / Docker images triggered by their age? Is this possible @John Simmons (Unlicensed) / @Liban Hirey (Unlicensed) ?

None.

See #monitoring-prod Slack channel.
“AWS GMC Connect down“.
Often also alerted in Teams.
Not rare.

Service ran out of memory.

Is it a Saturday? This message has appeared several Saturdays in a row.
HEE’s Reval service crashing.
Ran out of memory.
Normally resolves itself.
Warn users on Teams - with a clear indication of when things will be back in sync.
Restart Reval service. Maybe clear some memory first.

Note: there’s an In progress query in Teams (from Mon 14 Dec) that appears more of an internal HEE issue to resolve, but outside TIS team (data leads).

None.

See #monitoring Slack channel.
“Person Owner Rebuild Job failed to run / complete on either blue or green servers”.
Rare.

Someone updating person when this job is running.

This kind of failure sends a @channel notification.
Person filter not working.
Re-run the job. A list of what other jobs need to be run can be found on Github "Summary of what to run"

You can run them from:

None.

HEE colleagues raise issues they're having. Often their colleagues will corroborate, or resolve their problem within MS Teams.

For anything else,

Data quality errors inherited from Intrepid.
Data quality errors introduced by Bulk Upload (where there is insufficient validation to catch the problem).
Genuine bug the TIS team have introduced.

Which environment are they having the problem on?
What permissions do they have?
Is there any user error at play?

See bulk/generic upload issue above.
Try to get as much information from the person raising the issue as possible (without making them any more frustrated than they are!).
Verify the problem yourself on Stage if feasible.

Check with Alistair / Seb and raise it in the #fire_fire Slack channel, stating whether or not you have been able to replicate. And create a new #fire-fire-yyyy-mm-dd channel if confirmed.

Depends on the issue.

See #monitoring-prod Slack channel.

“PersonPlacementTrainingBodyTrustJob” fails to start for the NIMDTA Sync service, which then stops the rest of the sync jobs from starting.

Happens nightly

 

Manually restart the NIMDTA Sync jobs on https://nimdta.tis.nhs.uk/sync/

None

Emergency procedures in case of PRODUCTION capitulation.

This page is intended to be used to capture all issues encountered on the PRODUCTION environment to aid lessons learned and feed into continuous development