3 Key Metrics for Efficient Performance Monitoring
Today’s networks have evolved and become more complex to manage. Network administrators are inundated with so many network monitoring tools on the market today, that choosing the right ones can be challenging. The growing complexities of public, private and cloud environments mean that network admins need access to critical information about the health and performance of their network. Keeping track of valuable metrics is important for troubleshooting network problems and for improving the end user experience. When network admins are equipped with the right tools and metrics, they have the end-to-end insight and network visibility they need to optimize network performance.
One of the most effective solutions for network monitoring is a network performance monitoring and diagnostic tool (NPMD). In short, NPMD tools provide the ability to detect, identify, and prevent issues related to the many applications that pass through the Internet and the networking devices and appliances that are part of the physical infrastructure. The goal of NPMD tools is to reduce outages, provide troubleshooting information if/when incidents occur, and optimize performance.
It’s all about the metrics
There are many metrics network admins track daily and all provide relevant information regarding network performance. We’ve selected our top 3 metrics we think every network admin should monitor:
1. Link Utilization
2. TCP Retransmissions
3. Network and Server Response Time
Have you ever experienced latency and dropped packets? We’ve all been victims of slower networks and packet loss, so bandwidth occupancy is one key metric you want to keep your eye on. It’s important to know what’s happening in the network links in case they become congested. If they are, you will notice an increase in packet drops occurring at the routers. Secondly, make sure the bandwidth occupancy you’re measuring is shown with millisecond granularity. Believe it or not, critical network issues such as buffer oversubscription can occur within milliseconds so if your current tools only show per second analysis, you’re not seeing the complete picture. Think about it, a millisecond advantage in preventing a critical network issue translates to improved customer experience, better business outcomes and greater revenue.
You should also consider which IP is occupying most of the traffic in order to isolate any sources of network congestion and/or anomalous behavior. Typically, the network’s top talkers will identify the specific data flows that represent most of the bandwidth usage. Figure 1 below shows a dashboard measuring bandwidth and top talkers.
Figure 1: Example dashboard showing bandwidth and top talkers
Transmission Control Protocol (TCP) retransmission rate is a good indicator of network health and one of the most valuable metrics to measure. By definition, retransmission is the resending of packets that have been damaged or lost. Packet retransmissions are a healthy function of modern TCP networks. They can occur when a receiving node doesn’t acknowledge when a packet is sent from a sending node. While these are expected to occur on a normal network, a sustained increase in retransmissions warrant further investigation. For example, there could be a saturated network link and/or some segment of the network that is causing packets to drop. If left unresolved, a large count of retransmissions will negatively impact the performance of your applications. Figure 2 below shows a dashboard with retransmission and response time metrics.
Figure 2: Example dashboard showing retransmission and response time metrics
Network and Server Response Time:
Many applications today are based on TCP protocol and a client/server model. The Application response time metric measures the time it takes for a server to respond to a data request with application data. This metric can tell us how quickly the application is responding to requests. If the response time increases, this indicates that the application is running slowly.
Network Round Trip Time (RTT) is another great indicator of overall network health, as well as the health and response time of the TCP/IP stack of your server. Basically, it’s the amount of time (typically in milliseconds), that it takes for a server to respond to a packet sent by a client. If a server is overwhelmed with requests, such as during a DDoS attack, its ability to respond efficiently is inhibited, resulting in increased round trip time.
Tying it All Together
NPMD tools provide great insight into the health and performance of your applications and network. The metrics mentioned above should give you a good starting point, but for meaningful analysis, you need to dig deeper and correlate network traffic metrics with individual processes.
Clearly, having all this information about the quality of your network is useful and can provide exceptionally deep insights. Whichever tool you use, it’s important to understand the fundamentals of network performance metrics and how they can help you achieve error-free network performance.
Which metrics are you using to monitor your network? Let us know in the comment section below.