Network Fault Management: How Can Telcos Make Connectivity Issues A Thing Of The Past!
How many times have you experienced call drops? Annoying, right? The key challenge in the telecom industry is managing network faults, as it directly impacts the customer attrition rate. If a telecom provider does not offer a good and reliable network, customers will most likely port to other service providers.
Why is network fault management so critical for telcos?
What is a network fault? In telecom, a network fault is when a failed network port causes a slow network and intermittent loss of network connectivity between network hosts, resulting in a traffic drop. One of the key challenges telcos often face is when a fault in network devices leads to the entire network suffering a service outage, congestion, service degradation, traffic drop, and long downtime.
There are various types of network faults. It could happen due to the interface or port flapping in the network, high utilization or CRC error in the link, or other warning or alarm generation in a network device.
What is the solution to such network fault issues, and how can telcos overcome them efficiently? The answer is simple! Network automation has helped several companies avoid network faults, save costs, reduce downtime, and offer better customer experiences. When a network is automated, it can detect a potential fault, identify the source of the problem, and resolve it with little or no manual intervention. It has a set of functions that detects, isolates and takes appropriate actions to correct these malfunctions.
A network management tool has different building blocks that are interconnected to provide the best solution.
4 key steps in the network fault management workflow
Step 1: Identify the originator or source of the fault
This could be captured by various EMS/NMS systems from respective network devices. For example, Cisco devices have separate EMS systems. Similarly, Nokia, Huawei, and Juniper have their own NMS/EMS systems. The EMS system should capture all types of warnings/alarms generated by devices and extract or convert them to either a JSON or XML format, which could be a payload for the alarm collection Module.
Step 2: Collection of all such events by the alarm collection module
The alarm collection module collects all these events either in JSON or XML format from the various EMS/NMS systems where faults have been captured and publishes them to a Queue Management System. The alarm collection module is a service-level management system that deals with real-time, centralized monitoring of these network events. This module communicates with various NMS/EMS systems, collects the generated events in JSON or XML format, and modifies them with some additional parameters before sending them to Queue Management System.
Step 3: Managing multiple Queues
Based on event types or network sources, there could be multiple Queues. Each Queue carries a specific message based on the network type. For example, the MPLS network or ISP network has separate queues to carry these messages. The reason behind multiple queues is to enable asynchronous communication between the endpoint and the Queues, which makes data persistent when either the consumer or the producer system goes offline. Endpoints that are producing and consuming these messages should interact with the Queue. The producer can send a request without waiting for them to be processed, and the consumer process only consumes it when they are available.
Step 4: Ticket generation by the Event Manager
In the fourth and final step, an Event Manager tool generates tickets for network events. Once an issue has been identified, the Event Manager will be interacting with the remediation workflow and take necessary action on devices.
These alerts are turned into tickets on which the backend teams work. Engineers will work on these tickets. A ticket must have all information such as the time when the ticket was created, source of the problem, type of problem, event generation time at the network device, number of occurrences of the problem, history of the problem, etc. It lets users verify the same type of issues received earlier and the remediation taken on the ticket.
An example
For example, if there is an interface flap, it should provide information like how long the event existed and when the interface was stable after it got flapped. If an alarm for a device (A end) is generated, a similar alarm must have been received from its other end (Z- end) device. These two alarms should be merged into a single ticket to avoid the duplication of tickets.
Also, it should not create the same ticket again if it originates from the same network device with the same interface and type of fault. In this case, related alarms or events should be get accumulated into a single ticket. Also, a graphical representation should show the timeframe of the issues and the total number of alarms generated, which gives the user a clear understanding of network events generated at devices.
Then next is what action or remediation must be taken on that ticket once it is created. This depends upon the type of issues detected. For example, if it’s an interface flap, then the link’s cost can be increased. In the case of a bundle, the link interface can be dropped out from the bundle interface. So, the user must validate the dry run from the device with the current configuration if the proposed remediation already exists in the device. The Event Manager should provide this information by communicating with the respective network device.
Once the dry run is generated, the user must validate with the network’s pre or post-health check. In case of a pre-health check-up, the user should validate the network by running or executing respecting commands on the network device before the remediation and post-health check. After remediation, the user should validate the network by executing commands on network devices.
Once the pre-health check is passed, then the user can fire the commit request to the respective device. After that, the user should validate by firing post-health check commands and checking the request’s status if the configuration is pushed successfully into devices. Once all these above steps are validated, the user can now close the ticket.
These are the crucial steps to handle a ticket in case of a network fault.
Top network fault management tool use cases
- Service degradation due to link flapping
When there is no link failure, but the link is flapping (toggling up and down). Services will degrade, and traffic will drop. - Service degradation due to interface errors (Low RX and CRC)
Low-RX, Cyclic Redundancy Check (CRC) errors on a link can cause degradation in the quality of throughput and unexpected network drops. - Service degradation due to links errors in some of the bundle links
It is common to have bundle links with Multiple 10gig or 100gig, or 400gig links. In this case, traffic loads are balanced between member links. However, due to interface errors, some of the members’ links will cause a traffic drop. - High utilization/congestion performance
Between the source and the destination, there are multiple ECMP (Equal-Cost Multipath) paths. However, in the event of a link failure, if the remaining links are over-utilized, it will cause traffic drops and service degradation. - Silent drops due to H/W and software errors
Silent drops occur in the network due to potential software or hardware issues/bugs. The result of such drops is that it causes packet drops without the usually known events like protocol flaps, interface flaps, or other known triggers.
Conclusion
Network fault management tool has numerous advantages. Manual efforts to detect and resolve an issue might be error-prone and time-consuming. It also requires significant manpower to identify issues in the network and resolve them accordingly. A network fault management tool will help save costs, eliminate repetitive work, resolve issues without much delay, and deliver superior customer experiences.
Latest Blogs
Tired of spending countless hours troubleshooting failed API tests and keeping up with constant…
The business world is moving quickly and the only way to make informed decisions is to leverage…