Continual Service Improvement: Problem detection

It is likely that multiple ways of detecting problems will
exist in all organizations. These will include:
■ Suspicion or detection of an unknown cause of one or
more incidents by the Service Desk, resulting in a
Problem Record being raised – the desk may have
resolved the incident but has not determined a
definitive cause and suspects that it is likely to recur,
so will raise a Problem Record to allow the underlying
cause to be resolved. Alternatively, it may be
immediately obvious from the outset that an incident,
or incidents, has been caused by a major problem, so
a Problem Record will be raised without delay.
■ Analysis of an incident by a technical support group
which reveals that an underlying problem exists, or is
likely to exist.
■ Automated detection of an infrastructure or
application fault, using event/alert tools automatically
to raise an incident which may reveal the need for a
Problem Record.
■ A notification from a supplier or contractor that a
problem exists that has to be resolved.
■ Analysis of incidents as part of proactive Problem
Management – resulting in the need to raise a
Problem Record so that the underlying fault can be
investigated further.
Frequent and regular analysis of incident and problem
data must be performed to identify any trends as they
become discernible. This will require meaningful and
detailed categorization of incidents/problems and regular
reporting of patterns and areas of high occurrence. ‘Top
ten’ reporting, with drill-down capabilities to lower levels,
is useful in identifying trends.
Further details of how detected trends should be handled
are included in the Continual Service Improvement
publication.
Regardless of the detection method, all the relevant detail
of the problem must be recorded so that a full historic
record exists. This must be date and time stamped to
allow suitable control and escalation.
A cross-reference must be made to the incident(s) which
initiated the Problem Record – and all relevant details
must be copied from the Incident Record(s) to the
Problem Record. It is difficult to be exact, as cases may
vary, but typically this will include details such as:
■ User details
■ Service details
■ Equipment details
■ Date/time initially logged
■ Priority and categorization details
■ Incident description
■ Details of all diagnostic or attempted recovery
actions taken.
Problems must be prioritized in the same way and for the
same reasons as incidents – but the frequency and impact
of related incidents must also be taken into account. The
coding system described earlier in Table 4.1 (which
combines impact with urgency to give an overall priority
level) can be used to prioritize problems in the same way
that it might be used for incidents, though the definitions
and guidance to support staff on what constitutes a
problem, and the related service targets at each level,
must obviously be devised separately.
Problem prioritization should also take into account the
severity of the problems. Severity in this context refers to
how serious the problem is from an infrastructure
perspective, for example:
■ Can the system be recovered, or does it need to be
replaced?
■ How much will it cost?
■ How many people, with what skills, will be needed to
fix the problem?
■ How long will it take to fix the problem?
■ How extensive is the problem (e.g. how many CIs are
affected)?
■ Chronological Analysis : When dealing with a difficult
problem, there are often conflicting reports about
exactly what has happened and when. It is therefore
very helpful briefly to document all events in
chronological order – to provide a timeline of events.
This often makes it possible to see which events may
have been triggered by others – or to discount any
claims that are not supported by the sequence of
events.
■ Pain Value Analysis : This is where a broader view is
taken of the impact of an incident or problem, or
incident/problem type. Instead of just analysing the
number of incidents/problems of a particular type in a
particular period, a more in-depth analysis is done to
determine exactly what level of pain has been caused
to the organization/business by these
incidents/problems. A formula can be devised to
calculate this pain level. Typically this might include
taking into account:
● The number of people affected
● The duration of the downtime caused
● The cost to the business (if this can be readily
calculated or estimated).
By taking all of these factors into account, a much
more detailed picture of those incidents/problems or
incident/problem types that are causing most pain can
be determined – to allow a better focus on those
things that really matter and deserve highest priority
in resolving.
■ Kepner and Tregoe: Charles Kepner and Benjamin
Tregoe developed a useful way of problem analysis
which can be used formally to investigate deeper-
rooted problems. They defined the following stages:
● defining the problem
● describing the problem in terms of identity,
location, time and size
● establishing possible causes
● testing the most probable cause
● verifying the true cause.
The method is described in fuller detail in Appendix C.
■ Brainstorming: It can often be valuable to gather
together the relevant people, either physically or by
electronic means, and to ‘brainstorm’ the problem –
with people throwing in ideas on what the potential
cause may be and potential actions to resolve the
problem. Brainstorming sessions can be very
constructive and innovative but it is equally important
that someone, perhaps the Problem Manager,
documents the outcome and any agreed actions and
keeps a degree of control in the session(s).
■ Ishikawa Diagrams : Kaoru Ishikawa (1915–89), a
leader in Japanese quality control, developed a
method of documenting causes and effects which can
be useful in helping identify where something may be
going wrong, or be improved. Such a diagram is
typically the outcome of a brainstorming session
where problem solvers can offer suggestions. The
main goal is represented by the trunk of the diagram,
and primary factors are represented as branches.
Secondary factors are then added as stems, and so on.
Creating the diagram stimulates discussion and often
leads to increased understanding of a complex
problem. An example diagram is given in Appendix D.
■ Pareto Analysis : This is a technique for separating
important potential causes from more trivial issues.
The following steps should be taken:
1 Form a table listing the causes and their
frequency as a percentage.
2 Arrange the rows in the decreasing order of
importance of the causes, i.e. the most important
cause first.
3 Add a cumulative percentage column to the
table. By this step, the chart should look
something like Table 4.2, which illustrates 10
causes of network failure in an organization.
4 Create a bar chart with the causes, in order of
their percentage of total.

Continual Service Improvement

понедельник, 30 января 2012 г.

Problem detection

Комментариев нет:

Отправить комментарий

понедельник, 30 января 2012 г.

Problem detection

Комментариев нет:

Отправить комментарий

понедельник, 30 января 2012 г.