The game of ping pong is something I enjoyed in my youth, like many of us. Working in the IT industry, I didn’t expect to be isolating network issues playing ping pong with various operational teams and support groups. But here we are, the bottom of 2021, and exonerating the network infrastructure can still be very frustrating!
Network metric data and system telemetry is increasingly providing evermore valuable insights into network flow behavior and anomalies. The deluge of data can be overwhelming, including noncritical events, logs, metrics, and alerts.
The shared responsibilities in the cloud and layered network ownership create increasingly challenging incident ownership, troubleshooting, and dealing with the SLA process for today’s hybrid operational teams. Incident Response optimization and reducing MTTR become the focus when dealing with multiple support teams and organizations and quickly getting to the data! Not only do we deal with the operational ping pong with our server, platform, or DevOps teams, now we have to factor in Cloud Service Provider support teams like AWS and Azure. Send us your PCAP!
Before dropping into a use case for cloud operations, let’s first review a few definitions. Figure 1 below shows a typical client-server TCP connection flow. Having an agentless monitor (cVu) appliance in the conversation path, we can report on many Key Performance Indicators (KPIs) to help understand the health of the connection flow and specific latency.
Figure 1 – TCP Connection Flow
|Server Response Time||Network Latency (DIFF between packet SYN and SYN-ACK)|
|Server RTT||The average round trip time from the server (Network Latency + Server Processing)|
|Client RTT||The average round trip time from the client (Network Latency + Client Processing)|
|zWins||The number of TCP Zero Windows from the hosts across all active sessions|
|Connection Error||Initial connection SYN packet not acknowledged (no SYN-ACK)|
|Retransmissions||The number of retransmissions from the host|
|Active Sessions||The number of TCP sessions that sent packets during the measurement time slice|
|Network Monitor||VLAN segment, port groups, CDIR block, or network vantage point etc|
At cPacket we like to talk about the 4Ws of pinpointing the root cause of complex problems, the What, Where, When, and Why? cPacket adds Virtual Packet Broker appliances (Network Packet Broker cVu®-V) into the infrastructure to provide lossless network monitors (aka collectors or vantage points) to collect, replicate, filter, and forward packets. Network monitors strategically located in the network infrastructure forward traffic to security, forensics, NDR, performance, and packet capture tools. cPacket cStor® Packet Capture appliance provides network packet storage and archiving for forensic investigation, and the cClear® Analytics Engine appliance provides the KPIs visualizations through a single pane of glass.
Upon receiving a call from a Help Desk or a disgruntled customer, the priority is to identify the root cause of the problem and separate the client/server, application, or a pure network infrastructure problem as quickly as possible.
Is it the VM instance, application, or network?
For the reported incident, do you have an IP address?
If the Help Desk reports server or instance with the IP address for investigation, select the
cClear> Capture option
Enter reported Server IP address and time period under investigation and add any filtering to reduce the noise in the PCAP file. Select> Download
Figure 2 – Select Download PCAP for 10.51.10.207
It really is that easy and quick to download the aggregated PCAP file across multiple collectors in the network. Select the “Range Settings” for the incident time period under investigation to work on the archived forensic data.
Figure 3 – PCAP File for 10.51.10.207
If you do not have specific client/server details, goto Dashboards>TCP Health
From the Dashboards options, Figure 4 shows the TCP Health displayed via the network segments horizontally (i.e., DMZ, AWS, LAB) and the KPIs listed in columns. This tells you which part of the network is displaying problematic issues and which are operating normally. This gives the operator a high-level view of the network segments and a general indication of health. This is an excellent high-level starting point. The TCP Health visualization below very quickly shows the incident isolated in the LAB segment, impacting Server Retransmissions and Zero Window KPIs and network services healthy for the last 5 minutes. The What, Where, and high-level When.
Figure 4 – TCP Health Dashboard
By clicking on the KPI LAB Server Retransmissions (red box), this will take you to a drill-down visualization showing the IP addresses in the flow for the last 5-minutes (Figure 5). This view will show you both the client and server-side involved in the Server Retransmissions.
Figure 5 – TCP Errors Level 2 – Server Side Analysis
Now we have the IP addresses we are interested in, selecting the download is very simple as shown in Figure 6. There are options for filtering, including Berkley Packet Filtering (BPF) homing in on the data of interest.
Figure 6 – Select Download PCAP 10.51.10.207
Figure 7 – Wireshark PCAP Forensic file for 10.51.10.207
In this incident example, we discovered the network was operating as expected. The connectivity between the two offending hosts was generating out-of-order TCP sequence packets. This is the time to engage with the server and/or application team to let them know further investigation of the two nodes in the LAB network requires detailed inspection. Send over the PCAP file!
The team discovered the port 443 connection was coming from a development vSphere VM instance to an engineering server in a hang state. The system was no longer responding to user inputs, but its IP address was still responding.
At cPacket Networks, we understand network visibility and the operational complexity of today’s evolving networks. The power of cCloud® Visibility Suite for troubleshooting consists of agentless appliances for broking, capturing, forwarding, and analyzing network traffic essential for isolating an issue between the Network, Server, or Application teams. This gives much greater confidence when working on a P1 incident during an enormously stressful time. Be the hero rather than playing ping pong.