Negative Downtime – A Positive for Customer Experience
|Author: Sadia Ahmad
April 27, 2020
Mobile network evolution is at the forefront of technological advancement. The need for high capacity and low latency networks has laid significant challenges for mobile operators, especially for newly deployed 5G networks. The operator’s network must deliver an excellent customer experience (CEX) for their subscribers as well as meeting the increasingly stringent requirements placed upon the network by use cases such as Ultra Reliable and Low Latency communication (URLLC). Intelligent network operation technologies, such as those involving AI Anomaly Detection/ Diagnostic mechanisms, are essential in realizing automated, pro-active and cost-effective operation of these networks.
In this area of network operations, it is interesting to consider the concept of negative downtime. Negative downtime is regarded as a form of preventive maintenance where a fault is fixed before the customer notices any impact on their network experience. From a network operations standpoint, negative downtime is about being proactive in managing the networks and negating any kind of disruptions, thereby providing a reliable communications infrastructure and an excellent customer experience.
One of the first significant references to negative downtime can be found in the heavy vehicle company Caterpillar. To minimize customer downtime and to get ahead with their customer service, the company focused heavily on customer-centric policies.
This process is called ‘negative’ because of its proactive nature where an issue is resolved before the customer notices any service disruption. Caterpillar realized the negative downtime concept using four main principles:
- Having redundancy for critical components
- Employing advanced diagnostics for constant system monitoring
- Having network-wide connectivity to a central hub and auto-notification on errors
- Available service personal for hardware issues
Could we apply some or all of these principle to the operation of a mobile network to realize the concept negative downtime? Negative downtime in this scenario would mean that the onset of performance degradation, hardware failures or cell outages could be detected automatically, and preventative actions taken before there is any impact on customer experience or reliability of communication – and this would be of significant benefit to operators both from a network operations and customer satisfaction perspective.
At Aspire Technology we are building an Intelligent Operation (iOps) framework, the goal of which is to develop solutions that introduce AI and automation into operations to assist and manage customer networks. A significant part of the iOps framework is the application of machine learning (ML) for early detection of anomalies in KPI behavior. This early detection of issues is one of the key milestones in realizing the principle of negative downtime.
Our ML-based KPI anomaly detection system uses historical KPI data to learn the normal performance behavior of different network entities, e.g., cells, sites or clusters. Once the algorithms have been trained and learned this normal behavior, the current real-time behavior is compared to what is considered normal and any deviation or anomalies automatically detected and given a severity level for follow up actions. This ML-based solution is superior to more traditional threshold-based detection methods, as the baseline KPI values for triggering an anomaly are based on the actual historical performance behavior of the cell rather than a fixed threshold. This allows for more accurate detection of deviations in performance right down to cell level.
As we develop our AI-based systems further, the goal is to capture patterns or signatures of the network around the time when the anomaly was detected. These patterns can then be continuously scanned for in the network to preempt the occurrence of an anomaly and take preventative actions before it becomes a CEX impacting issue – thus realizing the concept of negative downtime. As more and more patterns are observed by the system, it continuously learns an increasing number of fault scenarios for which it can take preventative actions.
Despite the advantages of working towards the negative downtime philosophy, there are a few hurdles which need to be crossed. One is that the current reporting capabilities of network equipment may not be sufficient to truly realize the concept of negative downtime. For example, for KPI monitoring, having 15 mins reporting granularity on performance data is most likely too long to truly realize negative downtime. Second is the need for quick response to hardware malfunctions. It may be possible to detect an issue before it becomes customer impacting but if the fix requires the need for personal to go to site and replace hardware, the benefits of early detection may be lost. Solving this problem may require more redundancy in the network equipment.
The road to realizing the negative downtime concept in mobile networks is not straightforward and there are many challenges. Automation and AI/ML techniques are forming a key part of the solution in the early detection and prevention of issues at network level scale that will help to deliver a real positive for customer experience.
Aspire Technology, though its innovative culture is at the forefront of employing the latest technological solutions, provides a tactical edge in the highly competitive market. Aspire data science, and domain experts are available to help you define your optimum CEX solution and introduce the principle of negative downtime as an element in the management and operations of your network.
If you would like to know more, I am happy to answer any of your questions, just drop me an email at email@example.com.