Continual Service Improvement

понедельник, 30 января 2012 г.

Problem detection

It is likely that multiple ways of detecting problems will
exist in all organizations. These will include:
■ Suspicion or detection of an unknown cause of one or
more incidents by the Service Desk, resulting in a
Problem Record being raised – the desk may have
resolved the incident but has not determined a
definitive cause and suspects that it is likely to recur,
so will raise a Problem Record to allow the underlying
cause to be resolved. Alternatively, it may be
immediately obvious from the outset that an incident,
or incidents, has been caused by a major problem, so
a Problem Record will be raised without delay.
■ Analysis of an incident by a technical support group
which reveals that an underlying problem exists, or is
likely to exist.
■ Automated detection of an infrastructure or
application fault, using event/alert tools automatically
to raise an incident which may reveal the need for a
Problem Record.
■ A notification from a supplier or contractor that a
problem exists that has to be resolved.
■ Analysis of incidents as part of proactive Problem
Management – resulting in the need to raise a
Problem Record so that the underlying fault can be
investigated further.
Frequent and regular analysis of incident and problem
data must be performed to identify any trends as they
become discernible. This will require meaningful and
detailed categorization of incidents/problems and regular
reporting of patterns and areas of high occurrence. ‘Top
ten’ reporting, with drill-down capabilities to lower levels,
is useful in identifying trends.
Further details of how detected trends should be handled
are included in the Continual Service Improvement
publication.
Regardless of the detection method, all the relevant detail
of the problem must be recorded so that a full historic
record exists. This must be date and time stamped to
allow suitable control and escalation.
A cross-reference must be made to the incident(s) which
initiated the Problem Record – and all relevant details
must be copied from the Incident Record(s) to the
Problem Record. It is difficult to be exact, as cases may
vary, but typically this will include details such as:
■ User details
■ Service details
■ Equipment details
■ Date/time initially logged
■ Priority and categorization details
■ Incident description
■ Details of all diagnostic or attempted recovery
actions taken.
Problems must be prioritized in the same way and for the
same reasons as incidents – but the frequency and impact
of related incidents must also be taken into account. The
coding system described earlier in Table 4.1 (which
combines impact with urgency to give an overall priority
level) can be used to prioritize problems in the same way
that it might be used for incidents, though the definitions
and guidance to support staff on what constitutes a
problem, and the related service targets at each level,
must obviously be devised separately.
Problem prioritization should also take into account the
severity of the problems. Severity in this context refers to
how serious the problem is from an infrastructure
perspective, for example:
■ Can the system be recovered, or does it need to be
replaced?
■ How much will it cost?
■ How many people, with what skills, will be needed to
fix the problem?
■ How long will it take to fix the problem?
■ How extensive is the problem (e.g. how many CIs are
affected)?
■ Chronological Analysis : When dealing with a difficult
problem, there are often conflicting reports about
exactly what has happened and when. It is therefore
very helpful briefly to document all events in
chronological order – to provide a timeline of events.
This often makes it possible to see which events may
have been triggered by others – or to discount any
claims that are not supported by the sequence of
events.
■ Pain Value Analysis : This is where a broader view is
taken of the impact of an incident or problem, or
incident/problem type. Instead of just analysing the
number of incidents/problems of a particular type in a
particular period, a more in-depth analysis is done to
determine exactly what level of pain has been caused
to the organization/business by these
incidents/problems. A formula can be devised to
calculate this pain level. Typically this might include
taking into account:
● The number of people affected
● The duration of the downtime caused
● The cost to the business (if this can be readily
calculated or estimated).
By taking all of these factors into account, a much
more detailed picture of those incidents/problems or
incident/problem types that are causing most pain can
be determined – to allow a better focus on those
things that really matter and deserve highest priority
in resolving.
■ Kepner and Tregoe: Charles Kepner and Benjamin
Tregoe developed a useful way of problem analysis
which can be used formally to investigate deeper-
rooted problems. They defined the following stages:
● defining the problem
● describing the problem in terms of identity,
location, time and size
● establishing possible causes
● testing the most probable cause
● verifying the true cause.
The method is described in fuller detail in Appendix C.
■ Brainstorming: It can often be valuable to gather
together the relevant people, either physically or by
electronic means, and to ‘brainstorm’ the problem –
with people throwing in ideas on what the potential
cause may be and potential actions to resolve the
problem. Brainstorming sessions can be very
constructive and innovative but it is equally important
that someone, perhaps the Problem Manager,
documents the outcome and any agreed actions and
keeps a degree of control in the session(s).
■ Ishikawa Diagrams : Kaoru Ishikawa (1915–89), a
leader in Japanese quality control, developed a
method of documenting causes and effects which can
be useful in helping identify where something may be
going wrong, or be improved. Such a diagram is
typically the outcome of a brainstorming session
where problem solvers can offer suggestions. The
main goal is represented by the trunk of the diagram,
and primary factors are represented as branches.
Secondary factors are then added as stems, and so on.
Creating the diagram stimulates discussion and often
leads to increased understanding of a complex
problem. An example diagram is given in Appendix D.
■ Pareto Analysis : This is a technique for separating
important potential causes from more trivial issues.
The following steps should be taken:
1 Form a table listing the causes and their
frequency as a percentage.
2 Arrange the rows in the decreasing order of
importance of the causes, i.e. the most important
cause first.
3 Add a cumulative percentage column to the
table. By this step, the chart should look
something like Table 4.2, which illustrates 10
causes of network failure in an organization.
4 Create a bar chart with the causes, in order of
their percentage of total.

Service Operation processes

Error messaging is important for all components
(hardware, software, networks, etc.). It is particularly
important that all software applications are designed to
support Event Management. This might include the
provision of meaningful error messages and/or codes that
clearly indicate the specific point of failure and the most
likely cause. In such cases the testing of new applications
should include testing of accurate event generation.
Newer technologies such as Java Management Extensions
(JMX) or HawkNL™ provide the tools for building
distributed, web-based, modular and dynamic solutions for
managing and monitoring devices, applications and
service-driven networks. These can be used to reduce or
eliminate the need for programmers to include error
messaging within the code – allowing a valuable level of
normalization and code-independence.
Good Event Management design will also include the
design and population of the tools used to filter, correlate
and escalate Events.
The Correlation Engine specifically will need to be
populated with the rules and criteria that will determine
the significance and subsequent action for each type
of event.
Thorough design of the event detection and alert
mechanisms requires the following:
ЃЎ Business knowledge in relationship to any business
processes being managed via Event Management
ЃЎ Detailed knowledge of the Service Level Requirements
of the service being supported by each CI
ЃЎ Knowledge of who is going to be supporting the CI
ЃЎ Knowledge of what constitutes normal and abnormal
operation of the CI
ЃЎ Knowledge of the significance of multiple similar
events (on the same CI or various similar CIs
ЃЎ An understanding of what they need to know to
support the CI effectively
ЃЎ Information that can help in the diagnosis of problems
with the CI
ЃЎ Familiarity with incident prioritization and
categorization codes so that if it is necessary to create
an Incident Record, these codes can be provided
ЃЎ Knowledge of other CIs that may be dependent on
the affected CI, or those CIs on which it depends
ЃЎ Availability of Known Error information from vendors
or from previous experience.
4.1.10.4 Identification of thresholds
Thresholds themselves are not set and managed through
Event Management. However, unless these are properly
designed and communicated during the instrumentation
process, it will be difficult to determine which level of
performance is appropriate for each CI.
Also, most thresholds are not constant. They typically
consist of a number of related variables. For example, the
maximum number of concurrent users before response
time slows will vary depending on what other jobs are
active on the server. This knowledge is often only gained
by experience, which means that Correlation Engines have
to be continually tuned and updated through the process
of Continual Service Improvement.
4.2 INCIDENT MANAGEMENT
4.2.1 Purpose/goal/objective
The primary goal of the Incident Management process is
to restore normal service operation as quickly as possible
and minimize the adverse impact on business operations,
thus ensuring that the best possible levels of service
quality and availability are maintained. ЃeNormal service
operationЃf is defined here as service operation within
SLA limits.
The value of Incident Management includes:
■ The ability to detect and resolve incidents, which
results in lower downtime to the business, which in
turn means higher availability of the service. This
means that the business is able to exploit the
functionality of the service as designed.
■ The ability to align IT activity to real-time business
priorities. This is because Incident Management
includes the capability to identify business priorities
and dynamically allocate resources as necessary.
■ The ability to identify potential improvements to
services. This happens as a result of understanding
what constitutes an incident and also from being in
contact with the activities of business operational staff.
■ The Service Desk can, during its handling of incidents,
identify additional service or training requirements
found in IT or the business.
Incident Management is highly visible to the business, and
it is therefore easier to demonstrate its value than most
areas in Service Operation. For this reason, Incident
Management is often one of the first processes to be
implemented in Service Management projects. The added
benefit of doing this is that Incident Management can be
used to highlight other areas that need attention –
thereby providing a justification for expenditure on
implementing other processes.

воскресенье, 15 января 2012 г.

CSI approach

As the above figure shows, there are many opportunities
for CSI. The figure above also illustrates a constant cycle
for improvement. The improvement process can be
summarized in six steps:
■ Embrace the vision by understanding the high-level
business objectives. The vision should align the
business and IT strategies.
■ Assess the current situation to obtain an accurate,
unbiased snapshot of where the organization is right
now. This baseline assessment is an analysis of the
current position in terms of the business, organization,
people, process and technology.
■ Understand and agree on the priorities for
improvement based on a deeper development of the
principles defined in the vision. The full vision may be
years away but this step provides specific goals and a
manageable timeframe.
■ Detail the CSI plan to achieve higher quality service
provision by implementing ITSM processes
■ Verify that measurements and metrics are in place to
ensure that milestones were achieved, processes
compliance is high, and business objectives and
priorities were met by the level of service.
■ Finally, the process should ensure that the momentum
for quality improvement is maintained by assuring that
changes become embedded in the organization.

Since CSI involves ongoing change, it is important to
develop an effective communication strategy to support
CSI activities – ensuring people remain appropriately
informed. This communication must include aspects of
what the service implications are, what the impact on the
personnel is and the approach or process used to reach
the objective. In the absence of truth, people will fill in the
gap with their own truth.
Perception will play a key role in determining the success
of any CSI initiative. Proper reporting should assist in
addressing the misconceptions about the improvements. It
is important to understand why there are differences in
perception between the customer and the service
provider. Figure 2.4 identifies the most obvious potential
gaps in the service lifecycle from both a business and an
IT perspective:
Service Level Management has the task of ensuring that
potential gaps are managed and that when there is a gap,
to identify if there is a need for a Service Improvement
Plan (SIP). Often a large gap exists between what the customer wants, what they actually need, and what they
are willing to pay for. Add to this the fact that IT will often
try to define and deliver what they ‘think’ the customer
wants. As a result, it is not surprising that there is a
perception and delivery gap between the Customer and
IT.

понедельник, 30 января 2012 г.

Problem detection

Service Operation processes

воскресенье, 15 января 2012 г.

CSI approach

понедельник, 30 января 2012 г.

воскресенье, 15 января 2012 г.