Troubleshooting network issues as data rates increase towards 100G is becoming more and more mission-critical. In order to successfully identify and remediate service-impacting issues and lower MTTR (Mean-Time-To-Resolution), ITOps need to monitor a wide range of metrics and data sources, including packet data, in real-time.
Network infrastructure troubleshooting is a multiple-layer process – from the vague “something’s wrong” to the root cause analysis of a specific problem. The more disciplined the process and the better the understanding of the correlation between network behavior and issue impacting end-users, the faster problems can be resolved or handed off to appropriate teams for remediation.
The Big 3 Network and User Experience Issues
The perennial challenge with this process is that user complaints are usually vague. Users (whether they are an employee, a customer, or even an algorithm that’s sensitive to networking conditions) typically experience one of three things; “I can’t connect,” “the network is too slow,” or “my voice/video call quality is bad.”
Since each of these can be caused by multiple underlying issues, IT teams often struggle to narrow things down. For example, a slow network could be caused by network, application, or protocol latency, each of which might show itself through any one of a number of different metrics. But to the frustrated end-user, it all looks the same – and much can be lost in translation.
To find the root cause and speed up issue resolution, IT teams need not only the right tools for assessing network metrics but also a clear view of the correlation between user experience, measurable network behavior, and underlying network issues. To illustrate, let’s walk through the troubleshooting process.
Collect The Relevant Metrics: Step One
Organizations rely on many sources and types of network data to provide context to end-user complaints. Their fundamental need is setting up the network monitoring infrastructure so that IT has access to packet data, flow data, events and telemetry data, and server KPIs. This will give them the insights they need to identify the root cause for various scenarios.
There are particular metrics that are relevant to specific issues. For “the network is slow,” the correlating metrics would be one-way latency, round-trip time, Z-Win, DNS or HTTP latency, throughput (Gbps), packets per second (PPS), connections per second (CPS), or concurrent connections (CC). For “quality is poor,” look at jitter, sequence errors, retransmissions, and fragmentations. When “connectivity” is the problem, examine ICMP, HTTP, and SYN/ACK errors.
Narrow Down The Issue: Step Two
Once IT teams have access to the data they need, they can begin correlating various network behaviors to rule out possible causes and zero in on the actual issue. This varies based on which complaint they are troubleshooting.
- Slow Network – This is most likely caused by network overload, but it’s also possible that a server is too busy or the DNS server isn’t responding. As discussed, the relevant metrics are one-way latency (network issue), round-trip time or Z-Win (for application issues), and DNS or HTTP latency (for protocol issues). If the network latency is high, then either the overall amount of traffic on the network is too high, or it’s “bursty.” Looking at overall performance and throughput (Gbps), packets per second (PPS), connections per second (CPS), or concurrent connections (CC) should help determine which it is. If application or protocol latency is the cause, then the issue can be passed off to the appropriate team to resolve. Looking at both packet and flow data is especially important to troubleshooting a slow network. Flow data can identify top talkers or packets per second, but it can’t tell how bursty the network had been or the number of connections per second – packet data is required for that.
- Poor Quality – IT should monitor jitter, sequence errors, retransmissions, and fragmentation to diagnose these complaints. High rates of jitter and sequence errors suggest the issue is with network streaming, while retransmissions and fragmentation indicate the problem is packet loss. These could be caused by routing problems or MTU (Maximum Transmission Unit) fragmentation misconfigurations.
- Connectivity – This complaint could be caused by a problem with authentication, authorization, or a mistake in a piece of equipment’s Access Control List. To figure out which one it is, IT teams should first look at protocol errors for the device in question. Next, they should examine connection errors, like looking at packet data for SYN/SYN ACK errors to make sure the TCP/IP three-way handshake between the client and the server is intact.
Identify Root Cause: Step Three
By this point, IT should have found the root cause of the issue and can move to remediation. The issue is frequently a network configuration error, but other possibilities include network equipment failures, application errors or bugs, a DDoS attack, or certain other security incidents. But without access to a wide range of network metrics and packet data, IT will be left guessing which issue is really at play.