Today, some of the main challenges in NOC management, described in the
following diagram, are:
Troubleshooting billions of service alarms
Processing of around 20 million workflow management notifications by NOC
experts.
Manage millions of call center emails
Higher costs due to low use of workflow management.
Incident management is an area where we already use specialized system
structures. However, the constantly evolving nature of networks, both
technologically and for their implementation, makes it very difficult to
maintain handwritten rules in specialized systems. Automated incident
management independent of a data-controlled domain, without the need for
specific regulations, would greatly enhance automation of NOCs. For example, a
failure on one node can cause cascading failures on other nodes, resulting in a
series of alarms. Machine learning techniques allow us to discover contemporary
patterns in a sequence of signals and other events, allowing us to quickly
identify the root cause in most failure scenarios. This frees up the noc team
so they can focus on more complex challenges.
What kind of complexity does this imply?
Typical handling of NOC alarms involves mapping the received signals for
incidents using enrichment, aggregation, deduplication, and correlation
techniques. This is a challenge due to the heterogeneity of the alarm
information caused by the solutions of various technologies and various
providers used in current telecommunication networks. This heterogeneity makes
it difficult to create a harmonized view of the system and greatly increases
the complexity associated with fault detection and resolution.
Can we afford to encode domain knowledge long-term?
Current NOC solutions include rule-based alarm management from different
sources, such as nodes or service management systems or element / network
management systems. The rules are written in such a way that they convert
domain-specific information into an overview of the network in the NOC Center and also
include coding practices that handle / correlate alarms for proper grouping.
Developing this rule takes a long time and time. Continuous changes in
the network with new types of network nodes and the resulting new types of
alarms also complicate the development and maintenance of rules. Furthermore,
the generation / update of regulations must be carried out frequently;
otherwise, the rules database will be incomplete or even inaccurate.
Does this mean that we have stopped developing domain-oriented rules?
This does not mean that the development of traditional rules is
disappearing, but domain independent data approaches will augment it. In
addition, automatic detection of possible correlations between alarms can
enhance the rule-based approach when rules are incomplete or when
domain-specific knowledge has not yet been acquired.
The data-driven approach will help identify cross-domain correlations
and generate data-based information. Little by little, the system can evolve
towards a fully automated solution.
NOC-based data automation
We will share with you a case study on automatic incident training, root
causes and self-correcting scenarios that we are working on as part of our
investigation.
We apply the principles of Machine Intelligence (data mining and data
science) to discover patterns of behavior in large historical data sets. These
behaviors or patterns essentially signify a correlation between alarms and
concurrency patterns. An exciting aspect of our approach is that we not only
evaluate it as time series data, but we also examine how to deal with
categorical and largely symbolic information collected online and identify
latent behaviors.
This approach helps experts in the field learn unfamiliar and
evolutionary behavior patterns when the environment is multi-technology and
multi-vendor. These correlated and grouped models allow automatic grouping of
alarms, opening the way for automatic detection of network, source and
mechanical repair incidents.
With this approach, we can achieve intelligent grouping of alarms and
tickets with minimal manual participation; We can reduce or completely avoid
manual rule development, automatically identifying large, missing groups, and
we can reduce the total number of incident tickets.
No comments:
Post a Comment