понедельник, 30 января 2012 г.

Problem detection

It is likely that multiple ways of detecting problems will
exist in all organizations. These will include:
■ Suspicion or detection of an unknown cause of one or
more incidents by the Service Desk, resulting in a
Problem Record being raised – the desk may have
resolved the incident but has not determined a
definitive cause and suspects that it is likely to recur,
so will raise a Problem Record to allow the underlying
cause to be resolved. Alternatively, it may be
immediately obvious from the outset that an incident,
or incidents, has been caused by a major problem, so
a Problem Record will be raised without delay.
■ Analysis of an incident by a technical support group
which reveals that an underlying problem exists, or is
likely to exist.
■ Automated detection of an infrastructure or
application fault, using event/alert tools automatically
to raise an incident which may reveal the need for a
Problem Record.
■ A notification from a supplier or contractor that a
problem exists that has to be resolved.
■ Analysis of incidents as part of proactive Problem
Management – resulting in the need to raise a
Problem Record so that the underlying fault can be
investigated further.
Frequent and regular analysis of incident and problem
data must be performed to identify any trends as they
become discernible. This will require meaningful and
detailed categorization of incidents/problems and regular
reporting of patterns and areas of high occurrence. ‘Top
ten’ reporting, with drill-down capabilities to lower levels,
is useful in identifying trends.
Further details of how detected trends should be handled
are included in the Continual Service Improvement
publication.
Regardless of the detection method, all the relevant detail
of the problem must be recorded so that a full historic
record exists. This must be date and time stamped to
allow suitable control and escalation.
A cross-reference must be made to the incident(s) which
initiated the Problem Record – and all relevant details
must be copied from the Incident Record(s) to the
Problem Record. It is difficult to be exact, as cases may
vary, but typically this will include details such as:
■ User details
■ Service details
■ Equipment details
■ Date/time initially logged
■ Priority and categorization details
■ Incident description
■ Details of all diagnostic or attempted recovery
actions taken.
Problems must be prioritized in the same way and for the
same reasons as incidents – but the frequency and impact
of related incidents must also be taken into account. The
coding system described earlier in Table 4.1 (which
combines impact with urgency to give an overall priority
level) can be used to prioritize problems in the same way
that it might be used for incidents, though the definitions
and guidance to support staff on what constitutes a
problem, and the related service targets at each level,
must obviously be devised separately.
Problem prioritization should also take into account the
severity of the problems. Severity in this context refers to
how serious the problem is from an infrastructure
perspective, for example:
■ Can the system be recovered, or does it need to be
replaced?
■ How much will it cost?
■ How many people, with what skills, will be needed to
fix the problem?
■ How long will it take to fix the problem?
■ How extensive is the problem (e.g. how many CIs are
affected)?
■ Chronological Analysis : When dealing with a difficult
problem, there are often conflicting reports about
exactly what has happened and when. It is therefore
very helpful briefly to document all events in
chronological order – to provide a timeline of events.
This often makes it possible to see which events may
have been triggered by others – or to discount any
claims that are not supported by the sequence of
events.
■ Pain Value Analysis : This is where a broader view is
taken of the impact of an incident or problem, or
incident/problem type. Instead of just analysing the
number of incidents/problems of a particular type in a
particular period, a more in-depth analysis is done to
determine exactly what level of pain has been caused
to the organization/business by these
incidents/problems. A formula can be devised to
calculate this pain level. Typically this might include
taking into account:
● The number of people affected
● The duration of the downtime caused
● The cost to the business (if this can be readily
calculated or estimated).
By taking all of these factors into account, a much
more detailed picture of those incidents/problems or
incident/problem types that are causing most pain can
be determined – to allow a better focus on those
things that really matter and deserve highest priority
in resolving.
■ Kepner and Tregoe: Charles Kepner and Benjamin
Tregoe developed a useful way of problem analysis
which can be used formally to investigate deeper-
rooted problems. They defined the following stages:
● defining the problem
● describing the problem in terms of identity,
location, time and size
● establishing possible causes
● testing the most probable cause
● verifying the true cause.
The method is described in fuller detail in  Appendix C.
■ Brainstorming: It can often be valuable to gather
together the relevant people, either physically or by
electronic means, and to ‘brainstorm’ the problem –
with people throwing in ideas on what the potential
cause may be and potential actions to resolve the
problem. Brainstorming sessions can be very
constructive and innovative but it is equally important
that someone, perhaps the Problem Manager,
documents the outcome and any agreed actions and
keeps a degree of control in the session(s).
■ Ishikawa Diagrams : Kaoru Ishikawa (1915–89), a
leader in Japanese quality control, developed a
method of documenting causes and effects which can
be useful in helping identify where something may be
going wrong, or be improved. Such a diagram is
typically the outcome of a brainstorming session
where problem solvers can offer suggestions. The
main goal is represented by the trunk of the diagram,
and primary factors are represented as branches.
Secondary factors are then added as stems, and so on.
Creating the diagram stimulates discussion and often
leads to increased understanding of a complex
problem. An example diagram is given in Appendix D.
■ Pareto Analysis : This is a technique for separating
important potential causes from more trivial issues.
The following steps should be taken:
1 Form a table listing the causes and their
frequency as a percentage.
2 Arrange the rows in the decreasing order of
importance of the causes, i.e. the most important
cause first.
3 Add a cumulative percentage column to the
table. By this step, the chart should look
something like Table 4.2, which illustrates 10
causes of network failure in an organization.
4 Create a bar chart with the causes, in order of
their percentage of total.


Service Operation processes

Error messaging is important for all components
(hardware, software, networks, etc.). It is particularly
important that all software applications are designed to
support Event Management. This might include the
provision of meaningful error messages and/or codes that
clearly indicate the specific point of failure and the most
likely cause. In such cases the testing of new applications
should include testing of accurate event generation.
Newer technologies such as Java Management Extensions
(JMX) or HawkNL™ provide the tools for building
distributed, web-based, modular and dynamic solutions for
managing and monitoring devices, applications and
service-driven networks. These can be used to reduce or
eliminate the need for programmers to include error
messaging within the code – allowing a valuable level of
normalization and code-independence.
Good Event Management design will also include the
design and population of the tools used to filter, correlate
and escalate Events.
The Correlation Engine specifically will need to be
populated with the rules and criteria that will determine
the significance and subsequent action for each type
of event.
Thorough design of the event detection and alert
mechanisms requires the following:
ЃЎ Business knowledge in relationship to any business
processes being managed via Event Management
ЃЎ Detailed knowledge of the Service Level Requirements
of the service being supported by each CI
ЃЎ Knowledge of who is going to be supporting the CI
ЃЎ Knowledge of what constitutes normal and abnormal
operation of the CI
ЃЎ Knowledge of the significance of multiple similar
events (on the same CI or various similar CIs
ЃЎ An understanding of what they need to know to
support the CI effectively
ЃЎ Information that can help in the diagnosis of problems
with the CI
ЃЎ Familiarity with incident prioritization and
categorization codes so that if it is necessary to create
an Incident Record, these codes can be provided
ЃЎ Knowledge of other CIs that may be dependent on
the affected CI, or those CIs on which it depends
ЃЎ Availability of Known Error information from vendors
or from previous experience.
4.1.10.4 Identification of thresholds
Thresholds themselves are not set and managed through
Event Management. However, unless these are properly
designed and communicated during the instrumentation
process, it will be difficult to determine which level of
performance is appropriate for each CI.
Also, most thresholds are not constant. They typically
consist of a number of related variables. For example, the
maximum number of concurrent users before response
time slows will vary depending on what other jobs are
active on the server. This knowledge is often only gained
by experience, which means that Correlation Engines have
to be continually tuned and updated through the process
of Continual Service Improvement.
4.2 INCIDENT MANAGEMENT
4.2.1 Purpose/goal/objective
The primary goal of the Incident Management process is
to restore normal service operation as quickly as possible
and minimize the adverse impact on business operations,
thus ensuring that the best possible levels of service
quality and availability are maintained. ЃeNormal service
operationЃf is defined here as service operation within
SLA limits.
The value of Incident Management includes:
■ The ability to detect and resolve incidents, which
results in lower downtime to the business, which in
turn means higher availability of the service. This
means that the business is able to exploit the
functionality of the service as designed.
■ The ability to align IT activity to real-time business
priorities. This is because Incident Management
includes the capability to identify business priorities
and dynamically allocate resources as necessary.
■ The ability to identify potential improvements to
services. This happens as a result of understanding
what constitutes an incident and also from being in
contact with the activities of business operational staff.
■ The Service Desk can, during its handling of incidents,
identify additional service or training requirements
found in IT or the business.
Incident Management is highly visible to the business, and
it is therefore easier to demonstrate its value than most
areas in Service Operation. For this reason, Incident
Management is often one of the first processes to be
implemented in Service Management projects. The added
benefit of doing this is that Incident Management can be
used to highlight other areas that need attention –
thereby providing a justification for expenditure on
implementing other processes.

воскресенье, 15 января 2012 г.

CSI approach

As  the  above  figure  shows, there  are  many  opportunities
for CSI.  The  figure  above  also  illustrates a  constant  cycle
for improvement.  The  improvement  process  can  be
summarized  in six  steps:
■ Embrace  the  vision  by understanding the  high-level
business objectives.  The  vision  should align the
business and  IT strategies.
■ Assess the  current situation  to obtain  an accurate,
unbiased  snapshot  of where  the  organization  is  right
now.  This  baseline  assessment  is  an analysis  of the
current  position  in terms  of the  business,  organization,
people, process  and  technology.
■ Understand  and  agree  on  the  priorities  for
improvement  based  on  a  deeper development  of the
principles  defined  in the  vision.  The  full  vision  may be
years away  but  this  step provides  specific goals  and  a
manageable  timeframe.
■ Detail the  CSI  plan to achieve  higher  quality service
provision  by implementing ITSM processes
■ Verify that  measurements and  metrics are  in place  to
ensure  that  milestones  were achieved, processes
compliance  is  high,  and  business  objectives  and
priorities were met  by the  level of service.
■ Finally,  the  process  should ensure that  the  momentum
for quality improvement  is  maintained  by assuring  that
changes become  embedded  in the  organization.

Since CSI  involves  ongoing  change,  it  is  important  to
develop an effective  communication  strategy  to support
CSI activities  –  ensuring  people remain appropriately
informed.  This  communication  must  include aspects of
what the  service implications are,  what  the  impact on  the
personnel  is  and  the  approach or process  used to reach
the  objective. In the  absence  of truth,  people will  fill in the
gap  with their own truth.
Perception will  play  a  key  role  in determining the  success
of  any  CSI  initiative.  Proper  reporting should assist  in
addressing the  misconceptions  about  the  improvements.  It
is  important  to understand  why  there  are  differences  in
perception between  the  customer and  the  service
provider.  Figure  2.4  identifies  the  most  obvious  potential
gaps  in the  service lifecycle  from both a  business  and  an
IT  perspective:
Service  Level  Management  has  the  task  of ensuring  that
potential  gaps are  managed and  that  when  there  is  a  gap,
to  identify  if  there  is  a  need  for  a  Service Improvement
Plan  (SIP). Often  a  large gap  exists  between  what  the customer  wants,  what  they actually  need,  and  what  they
are willing  to pay  for.  Add  to this  the  fact  that  IT  will  often
try to define  and  deliver  what  they ‘think’  the  customer
wants. As a  result,  it  is  not  surprising that  there  is  a
perception and  delivery  gap  between  the  Customer and
IT.