Building the Next Generation of AI Data Center Networks

Following are some of the challenges encountered with AI networking:

The Bandwidth Challenge

A single NVL72 rack from Nvidia delivers 72 terabits of data across 180 connections and over 2,880 fiber connections. Bandwidth requirements are increasing by orders of magnitude, from 200 Gbps to 400 Gbps, 800 Gbps, and now 1.6 terabits per second. The critical challenge: scaling from a single rack to multiple racks while maintaining the performance AI workloads demand.

Latency: Every Nanosecond Counts

Silicon providers like Broadcom have optimized their chips for AI workloads, reducing latency from 600-750 nanoseconds down to approximately 250 nanoseconds— more than 50% reduction. But jitter, not just the latency, matters just as much. For training jobs involving thousands of coordinated nodes, even small timing variations can have cascading effects.

Rethinking Network Topology

Traditional Clos (leaf-spine) topologies are common for north-south traffic but can be limited in radix, requiring additional hieratical layers to accomplish more I/O ports within the cluster. . Dragonfly and Flattened Butterfly networks offer alternative topologies where server groups are interconnected in highly meshed patterns. This “flattens” the topology, maintains low node-to-node latency and allows easier expansion, though it introduces more cabling complexity.

Intelligent Telemetry and Security

Modern AI networks require advanced telemetry spanning the entire path from user workload to GPU to switch and back. A single link failure can trigger cascading alerts across thousands of connections, making AI-powered telemetry essential for filtering noise and identifying root causes.

Security must shift from "trust but verify" to zero-trust: verify everything, trust nothing. While this adds overhead, it's essential for protecting valuable AI workloads and sensitive training data.

Cooling: From Air to Liquid

Next-generation switches face thermal challenges. A 1.6 terabit transceiver can consume up to 24 Watts. Air cooling is becoming impractical for 2RU switch form factors. Direct liquid cooling (DLC) for network switches, aligning with server rack cooling infrastructure, is expected to take center stage in this year.

Edge Computing Infrastructure

While hyperscale data centers focus on training, inference is moving to the edge. Autonomous vehicles, medical diagnostics, and industrial automation require low-latency inference. These applications may not need massive bandwidth, but reliability and latency remain critical.

Key Considerations for AI networking Data Centers

Infrastructure Alignment: Use common infrastructure across your data center. Mixing air-cooled and DLC environments creates unnecessary complexity and cost. ‍
Interoperability: Modern data centers run heterogeneous environments with InfiniBand, Ethernet, and various protocols. Full orchestration is essential for full stack unification. ‍
Scalability: Technologies like expanded beam optics (EBO), high-density trunking, and shuffle panels help manage increasing fiber complexity. ‍
Future Technologies: Linear pluggable optics (LPO) and co-packaged optics (CPO) will reduce power consumption by more than 50% and help address system-level thermal challenges.

The Path Forward

AI data center networks require three focus areas: ultra-high bandwidth with low latency and reliability, operational efficiency through aligned infrastructure, and continuous innovation in ASIC and switch design.

Software-defined networking is evolving toward AI-powered orchestration spanning network devices, GPUs, servers, and workloads. We're seeing shifts in physical media from EBO (Expanded Beam Optics), transition from traditional transceivers to LPO (Linear Plugable Optics) and CPO (Co-Packaged Optics), and SerDes signaling from 100 Gbps PAM4 to 200 Gbps PAM4.

The next generation of network architecture isn't just about faster speeds or more bandwidth. It's about building holistic systems that can scale efficiently, operate reliably, and adapt to demanding AI workloads. Collaboration through initiatives like the Open Systems for AI (OSAI) consortium will be crucial in defining standards and best practices for this new AI era.

Watch the OCP25 talk by Michael Lane, VP of Networking at Hyve