IT infrastructure and operations (I&O) teams such as network operations (NetOps), development operations (DevOps), security operations (SecOps), and cloud operations (CloudOps) teams’ charter is to make their businesses more productive and competitive while having to manage ever evolving and more complex networks, often with the same or fewer resources. In order to keep up with these requirements IT teams need to have the right visibility tools that allow them to be agile, efficient, and proactive.
Visibility is required for troubleshooting and root-cause identification which highly depends on the information and data available. From the network perspective, the sources of data can be divided into three classes: events, flows, and packets. As will be covered below, network packet data provides the most useful and richest information for visibility, troubleshooting, and root cause analysis. This information is the most efficient way to resolve unplanned and hard-to-find problems.
The quality of the data and the quality of measurements determine how reliable the resolution process is. If the data is incomplete or the quality is compromised, the quality of the results will not be good enough for decision making (the “garbage-in, garbage-out” principle applies to visibility and troubleshooting).
Let’s compare the three common sources and the corresponding types of data to understand the pros and cons of each.
Using Events for Network Troubleshooting
Events are generated by a multitude of entities. Events include things from SNMP traps to debug and error messages that developers add to their code or general Syslog informative log messages. The technology of collecting events, indexing, and correlating them is mature and commonly used with many tools available to collect, store, browse, correlate, and analyze event data. An advantage of events is that the data is typically logged as human readable text that is straightforward to search and interpret. Systems that are well instrumented and implemented for robust event logging provide a powerful source of information you can use to quickly understand when something is wrong or is going wrong.
Risks of Using Only Events for Troubleshooting
The main advantage of events and the log data they generate is also its main weakness for troubleshooting. For events to work they must be well planned and instrumented. This requires the application developers and system integrators to instrument the solution in advance. The most thorough design and planning can only prepare for the things that the developers anticipated and prepared for. In other words, events and logs are a good source of information for the “known unknowns.”
Here’s an example of where relying only on events did not facilitate efficient troubleshooting –
Two VoIP switches from different vendors were not fully compatible. Neither development team from the respective vendors expected specific behavior of the other’s equipment, so there was no information in the logs of both VoIP switches that helped to understand and isolate the problem. Unfortunately, with the absence of relevant information the problem took weeks to understand and isolate, during which time the end-user experience was suboptimal.
Using Flows for Network Troubleshooting
In network and distributed application and services troubleshooting, the second common source of information is the flow or conversation/session level data. It is quite common in enterprises with complex services that the DevOps team does not always know which entities are really part of a specific service. Flow data is one way to summarize the connectivity and the load between two endpoints in the system. Flows summarize the data observed at a specific point in the network, such as a switch, router, or a specific port. Flow data provides a summary of packets and bytes as they travel between endpoints that can be identified by layer 2 (MPLS, VLAN), layer 3 (IP addresses), layer 4 (TCP/UDP) etc.
The good news is that mature technology exists for characterizing and collecting flow information that can then be reported, and many data paths support exporting flow information to devices that can consume and provide useful insights from it. NetFlow is a de facto standard that was introduced by Cisco in their routers. Over time NetFlow has evolved through 9 versions. Currently another common format, IP Flow Information eXport (IPFix, an IETF standard), is considered the equivalent to NetFlow version-10. Additionally, other vendors have introduced similar protocols: jFlow or cFlow for Juniper Networks, NetStream for 3Com/HP/Huawei Technologies, rFlow for Ericsson, AppFlow Citrix, sFlow vendors include: Arista Networks, Brocade, Cisco, Dell, and AWS.
Risks of Using Only Flow Information for Troubleshooting
The flow protocols include both the formatting of the information and the transport mechanism. They were optimized for implementation in routers without taxing the implementation and as such use User Datagram Protocol (UDP) to push the data. This is one of the disadvantages of the current flow protocols, since unlike TCP, UDP does not have a flow control mechanism and can overwhelm the network and the receiver causing information to be lost. Flow report data can also be lost in busy networks because the additional load on can overtax the monitoring system and the collector tools.
While NetFlow offers compatibility, the use of UDP limits scalability and robustness. This problem is even more pronounced when monitoring virtual environments where it’s impossible to have a physical separation between the data plane and the monitoring plane.
Another limitation of flow data is the stateless nature of the collection that limits the ability of flow information to measure key metrics such as latency, round-trip-time, retransmission delay, etc. While flows are useful in mapping out complex services and/or identifying endpoints that push too much data, they are very limited in troubleshooting configuration issues or error conditions.
Using Network Packet Data for Network Troubleshooting
Network packet data is absolutely the richest and therefore the best option for troubleshooting network, security, and application issues because the packets contain the ultimate source of truth. This is the only solution for the “unknown-unknown” type of problems that are often very hard to find and diagnose.
There are challenges with using network packet data, particularly the velocity and volume of data, especially for heavily used high-speed networks. Specific instrumentation is required to reliably capture, process, and analyze every single packet, so the performance of packet monitoring and capturing solutions matters. Such performance comes at a cost. But because network packet data is the best option for troubleshooting network, infrastructure, and application problems in the most precise way – the benefits outweigh the costs.
Let’s close this section by returning to the previous example – IT was able to very quickly identify the root cause of the problem and fix it by analyzing the conversation between the VoIP switches by examining the data packets exchanged over the network. Shrinking the time to resolution from several person-weeks to less than one person-day is a huge savings, and this does not fact in the gains from maintaining optimal experiences and productivity.
Risks of Using Packets for Troubleshooting
A robust solution must be designed from the ground up to provide data that is usable (a patchwork approach will not work). Usable data must possess these characteristics:
- Complete (i.e., lossless without gaps that result in blind spots)
- Consistent and reliable
- Precise with the appropriate resolution
Humans cannot inspect high-velocity and high-volume packet data. That is what Big Data and Analytics technologies are used for. So, you will need to use these technologies to extract information and insights from big and fast network packet data. Some technologies also cannot keep up with the velocity and volume of data. In particular, the evolving speeds and densities of networks is evolving faster than software running on a general-purpose CPU can keep up with. For this reason, you should be wary of a network monitoring and data packet capturing solution designed with shared resources that includes sharing a general-purpose CPU to do all the processing. Such an architecture lacks the processing power to ingest high-speed network packet data simultaneously from multiple ports and therefore will not meet these objectives.
Sampling is another way to handle the high volume of data. It is a technique that arose when flows were introduced in 1996. However, sampling flows from a network that runs at 10 Mbps is less demanding than sampling flows from networks that operate at 100 Gbps.
Other than sampling, which will not work for every situation, you will need to reliably capture data with the characteristics listed above. Your monitoring and capturing must be lossless otherwise there will be gaps due to dropped packets that create problematic blind spots. And, frankly, a lossy solution is really no solution.
Only a purpose-built solution can reliably and consistently capture network packet data, especially for dense high-speed networks. At the architecture level, distributed processing is necessary.
Dedicated Hardware and Algorithms Efficiently Inspect Every Single Packet
Unlike conventional architectures (e.g., designs from other vendors that use a general-purpose CPU and shared memory), cPacket’s design uses dedicated hardware and streamlined software algorithms on a per-port basis that is capable of inspecting every single packet at line rates of 100 Gbps and beyond. We call this “smart ports.” Inspecting every single packet has many valuable uses for troubleshooting network problems. First and foremost, cPacket’s hardware can reduce the amount of data into meaningful KPIs. The hardware summarizes the data with counters instead of processing millions of packets to provide enable the NetOps team observe things that happen in a network operating at 100 Gbps with millisecond-resolution and accuracy to within nanoseconds. This method of inspection reduces the amount of data that must be analyzed further.
Network Data Packets Are the Ultimate Source of Truth
Most of the “interesting” traffic in networks today is limited to a small percentage of the packets which is where advanced real-time packet processing adds significant value. However, in order to find the needle in the haystack it is not possible to use blind sampling, it requires inspecting every single packet.
For example, when executing a huge backup of petabytes of data, detailed analysis is only required for the signaling packets not for every packet. The cPacket solution inspects every single packet and with advanced processing can identify the packets that require detailed analysis in real-time with the massive count of every single packet. The combination of detailed and accurate KPIs with the detailed analysis based on key packets or key parts of the packet, allows network operators to quickly understand what is going on, troubleshoot problems quickly, and identify the root-cause problems. This approach, only possible with the powerful distributed processing designed into cPacket’s solution provides the benefits without the bloat.
Packets, Flows, Events – Which is Best for Troubleshooting Summary
Three sources and types of data were presented that facilitate troubleshooting networks, infrastructure, and enterprise applications. Each type of data has advantages, disadvantages, and challenges. Ideally using all three provides the most complete visibility. However, if you must choose only one source of data, choose network packet data because it is the richest.
The best type of data, network packet data, requires the most effort, consideration, and specialized equipment. So, when you are planning to build a solution to enable you to quickly and efficiently troubleshoot using network data, take your time and carefully architect a solution that will reliably integrate multiple sources and types of data into a complete solution. Also, as was covered in detail, performance matters a lot! So, make sure your evaluation goes beyond reviewing a table of specifications to also evaluate the underlying architecture. If you are depending on network packet data, then you will need to depend on a solution with distributed processing power that will completely, reliably, and consistently capture, process, and forward packets.
About the Author
Ron brings over 20 years of experience leading engineering teams thorough the creation and development of complex networking. Ron started his career in Qulacomm, where he was a lead system engineer for mobile telephony systems and the creation of IP that is part of the core 3G and 4G systems. Ron was a co-founder of Mobilian a wireless semiconductor company. He then joined Intel through the acquisition of Mobilian, where he led engineering teams in the wireless group and Intel’s new business group. He holds a BSc in Electrical and Computer Engineering from the Technion in Israel, and holds more than 15 granted US patents.