Reference process for responding to incidents (please update as appropriate / required):

Defect management is critical for ongoing product quality monitoring throughout the product’s lifecycle. Therefore, when dealing with live defect of any kind, the need to ensure process is properly followed sequentially to ensure issue(s) are thoroughly dealt with to enable more effective outcome is very important. As part of the process includes active participation of team working on the defect and the need for everyone to understand their role in every stage of the live defect process.

When anything appears to have fallen over, we need a clear efficient approach to providing information back to the POs / originators of the incident when they do. We especially need this when we will have a skeleton crew supporting the system over Christmas.

Sketch of a process for responding to issues:

How to respond to a Live Defect in TIS.png

We need to respond fast to the live system going down. And, on the other hand, we need not panic and make mistakes that will compound the issue. And finally, we need an approach to address the underlying problem to prevent it happening again.

General guide to where services are hosted: What runs where?

Incident

Probable causes

Actions

Risks

User unable to login.
No Slack notifications.
Usually alerted by users in Teams.
Rare.

User trying to access at the same time as a job is going on, on the server, or being restarted, or a build is taking place.

Is it a Trust user (normally clear from the message alerting us to the incident in the first place), or an HEE user?
Is it TIS, TISSS, Reval, NI?
Check the User Management Docker logs - often a roles / permissions issue. Sometimes incorrect password entered.
KeyCloak services may be down - check KeyCloak Docker logs.
On UM, Add permissions if necessary (double check with Alistair / Seb first, if possible). Reset user password if necessary.
Ask user to try again.

Alistair and Seb when ‘triaging’ issues, will often ask incident reporters to clear their cache / reset password etc as part of a series of standard actions to take, that have historically sometimes fixed the problem (rather than distracting devs if unnecessary).
We shouldn’t, however, need to ask users to clear their cache or reset their passwords (unless they’ve locked themselves out - in which case we should be alerted about that problem specifically).
If either of these actions do actually sort any problems out, then we also need to mitigate against them.
Cache-busting ticket needed? Effort vs amount of time wasted on this kind of response vs #user affected? (blue star) ?

Few - users may have greater / lesser access than they need if you assign them more permissions.

User creation.
No Slack notifications.
We’d normally be approached directly for this (email / Teams).
Rare.

HEE Admins usually resolve. Sometimes there’s a permission issue.

Essentially, point them to their LO Admin.

Refer to the main Confluence page for this: Admin User Management (roles and permissions) to check whether permissions appear to have been set up as expected.

None.

Generic upload.
None in any Slack channels.
We’d normally be approached via Teams.
Quite common.

Common errors have mainly been around using an old template or mis-entered dates e.g.

  • 1019-06-12, vs

  • 2019-06-12.

Check Docker logs.
Find offending file.
Check template for glaring manual errors.

Incrementally add validation rules and warnings (on the app / on the templates?) to trap these types of error.

None.

No confirmation that the NDW Stage or Prod ETL processed successfully.
See #monitoring-ndw channel.
Very rare now.

Another job failing / restarting and overlapping the schedule of the NDW ETLs.

Did another service fail / restart and interrupt one of the NDW ETLs?
Restart the ETL.

Once done, inform the NDW team to initiate downstream processes on the #tis-ndw-etl Slack channel.

Associated note: Any Dev materially changing Stage needs to be aware this is likely to materially change the ETLs between TIS and NDW. Therefore reflecting any changes in Stage within the ETLs is now part of the definition of done of any ticket that materially changes Stage.

None - restarting the ETLs will simply refresh data. No issues with doing that ad-hoc outside the scheduled jobs.

See #monitoring-esr and #monitoring-prod Slack channels.
“AWS RabbitMQ Prod bad data received from ESR in CSV file”.
Common.

Usually ESR sending data in extended Ascii character set and us reading in UTF-8.

Look at the Confluence page / team sharing Phil James did on handling this type of incident before he left.
Email david.mayall@nhs.net with the file name (or error message) asking them to re-submit the amended file or make a subsequent file as appropriate.

Switch our services to use extended Ascii set, if ESR can confirm this in their 000098 spec.
Python script being worked on (as of 2020-12-16: John Simmons) to automate(?) the fixing.
Java FileReader can use ISO-8859-1 or other common character sets.

Joseph (Pepe) Kelly / Ashley Ransoo ?

See #monitoring-prod Slack channel.
“AWS Monitor service non-operational”.
Common.

Temporary warning as a scheduled job is usually rendering the service unavailable.

Usually resolves itself.

No action needed, unless the warning repeats, or is not accompanied by a ‘Resolved’ pair message. Then open up the details in the Slack notification. Check the logs for what's broken and restart the service if needed. Consider stop and starting, rather than restarting, the VM (because stop and start will encourage it to start up again on a new VM, restarting, restarts in in-situ on the current VM, and if the current VM is the problem, you’re no better off). Failing that, call for Ops help!

None.

See #monitoring-prod Slack channel.
“AWS RabbitMQ Prod High messages waiting”.
Very common.

Overnight sync jobs require heavy processing:

  • PersonPlacementEmploying
    BodyTrustJob;

  • Person sync job.

Normally comes with a ‘Resolved’ pair message very shortly afterwards.

No action needed, unless you don’t see the ‘Resolved’ pair. Then check Rabbit to see whether things are running properly.

Joseph (Pepe) Kelly ?

See #monitoring-prod Slack channel.
“AWS Server storage is almost full“.
Not rare.

If not, then it’s usually an indication there are a bunch of backup files / unused Docker images that can be deleted.

Occasionally this resolves itself.
If not, call in the cavalry: an Ops guy needs to clear out some space!

An auto-delete schedule on backup files / Docker images triggered by their age? Is this possible John Simmons (Deactivated) / Liban Hirey (Unlicensed) ?

None.

See #monitoring-prod Slack channel.
“AWS GMC Connect down“.
Often also alerted in Teams.
Not rare.

Service ran out of memory.

Is it a Saturday? This message has appeared several Saturdays in a row.
HEE’s Reval service crashing.
Ran out of memory.
Normally resolves itself.
Warn users on Teams - with a clear indication of when things will be back in sync.
Restart Reval service. Maybe clear some memory first.

Note: there’s an In progress query in Teams (from Mon 14 Dec) that appears more of an internal HEE issue to resolve, but outside TIS team (data leads).

None.

See #monitoring Slack channel.
“Person Owner Rebuild Job failed to run / complete on either blue or green servers”.
Rare.

Someone updating person when this job is running.

This kind of failure sends a @channel notification.
Person filter not working.
Re-run the job. A list of what other jobs need to be run can be found on Github "Summary of what to run"

You can run them from:

None.

HEE colleagues raise issues they're having. Often their colleagues will corroborate, or resolve their problem within MS Teams.

For anything else, (blue star)

Data quality errors inherited from Intrepid.
Data quality errors introduced by Bulk Upload (where there is insufficient validation to catch the problem).
Genuine bug the TIS team have introduced.

Which environment are they having the problem on?
What permissions do they have?
Is there any user error at play?

See bulk/generic upload issue above.
Try to get as much information from the person raising the issue as possible (without making them any more frustrated than they are!).
Verify the problem yourself on Stage if feasible.

Check with Alistair / Seb and raise it in the #fire_fire Slack channel, stating whether or not you have been able to replicate. And create a new #fire-fire-yyyy-mm-dd channel if confirmed.

Depends on the issue.

See #monitoring-prod Slack channel.

“PersonPlacementTrainingBodyTrustJob” fails to start for the NIMDTA Sync service, which then stops the rest of the sync jobs from starting.

Happens nightly

Manually restart the NIMDTA Sync jobs on https://nimdta.tis.nhs.uk/sync/

None

Emergency procedures in case of PRODUCTION capitulation.

This page is intended to be used to capture all issues encountered on the PRODUCTION environment to aid lessons learned and feed into continuous development