What’s Needed for High-Fidelity, Low-Latency HPC Network Monitoring at 100Gbps

Enterprise migration to 100Gbps network speeds continues to accelerate, with high-performance computing (HPC) data centers leading the charge. According to data from Crehan Research, 100 and 25 Gbps Ethernet speeds rose 40 percent year-over-year in 2020, with more growth expected over the next several years.

HPC Use Cases for 100Gbps

HPC use cases, like financial trading, oil and gas, pharmaceutical, 3D rendering and modeling, meteorology, and other types of research benefit significantly from the increased speed of 100Gbps, but are also highly sensitive to issues like jitter and latency. To maintain a high-quality experience, to avoid costly downtime, and to meet security and compliance requirements, HPC operators must ensure their network monitoring capabilities keep pace with higher network speeds as they refresh their clusters to 100Gbps.

Technical Challenges for Lossless Monitoring

This is no small task; lossless monitoring (and the data acquisition, processing and metrics measurement that goes along with it) is technically challenging for data packets that go through the monitoring fabric every 6.7 nanoseconds. Packets that are missed can hide problems, impede IT efficiency and create business risks. Key performance indicator (KPI) observed at low resolution – or that are averaged – obscure details and won’t help address sophisticated cyberattacks, nor maintain increased network utilization due to cloud and digital transformation or assure high-quality experiences across distributed hybrid environments.

Technical Requirements for High-Fidelity Network Monitoring

There are several technical requirements for high-fidelity network monitoring that are especially important for HPC use cases. Here are six of the most pressing ones.

  1. Latency resolution – HPC operators must carefully evaluate the requirements of their particular workloads and ensure prospective monitoring solutions measure latency in sufficient detail. For example, if an application cannot tolerate more than 20 milliseconds of latency, then the monitoring solution must monitor latency with 10 millisecond resolution or better.
  2. Performance data close to the source – Every hop adds bias to network performance metrics like latency and jitter, so it’s best to monitor this information as close to the source as possible. HPC use cases where timing is crucial, like high-frequency trading, will require a way to add high-resolution timestamping to packet data and monitor performance metrics where those packets are captured, not several hops down the line. This may require deploying network packet brokers that offer those features on the box, rather than in a centralized monitoring solution.
  3. Ingest limitations on performance monitoring and security tools – If a company’s performance and security monitoring tools cannot ingest data at 100Gbps, they will drop packets and thus risk missing indicators of serious security and performance issues. To avoid this, HPC organizations must either replace their tools or deploy packet brokers that allow for control of packet data slicing, filtering, throttling, aggregation, and distribution. These packet brokers can consolidate packet streams and send them to the various tools at the most efficient data rate for each. This can prevent a network upgrade from forcing a tools upgrade.
  4. Packet capture hardware assistance – Network speeds over 10Gbps start to strain the capabilities of packet capture devices that aren’t built to handle it. A single general-purpose CPU architecture cannot capture and write packets to the disk at speeds greater than 10 Gbps. Hardware assistance is necessary to observe and process packets without dropping them and creating blind spots.
  5. Traffic burst and microbursts – Packet capture solutions have two speeds – a sustained capture speed that they can run at indefinitely, and a “burst” speed, which can be processed for a short period of time (usually less than one minute). This allows monitoring solutions to capture “bursts” of network traffic that go far above usual averages. It’s unusual ­– even for HPC networks – to run at full capacity all the time, so a solution with a 100Gbps burst capture is usually sufficient for most HPC use cases. Still, organizations should carefully assess their particular traffic scenarios. A second related issue are “microbursts” – very small but sharp spikes in network traffic that can be overlooked by monitoring solutions that lack sufficient resolution. HPC use cases that tend toward this type of “spikey” traffic, like financial services, will require a way to detect and analyze those microbursts.
  6. Virtualized network and public cloud – Most if not all HPC sites will have some workloads in the cloud or in a virtualized environment. Can your chosen network monitoring solution easily incorporate packet data from those sources? Accessing packet data in the public cloud is difficult and requires using a virtual packet broker with built-in traffic mirroring features that AWS and Google Cloud support or using a virtual packet broker in an “inline mode” to capture traffic en-route to the cloud, such as with Microsoft Azure. Even if cloud or virtualized network monitoring isn’t a priority for an enterprise now, it likely will be in the future, so consider this functionality when selecting a network performance monitoring solution.

Detailed visibility helps HPC clusters and enterprises troubleshoot issues more quickly and proactively, maximize security and ensure a positive user experience. To maintain that visibility in a 100Gbps low-latency environment, HPC operators and IT teams supporting those workloads should take care to ensure their chosen system has solutions for those six issues.