Building the Next Generation of AI Data Center Networks

The evolution from traditional data centers to AI-focused infrastructure represents one of the most significant shifts in network architecture we've seen in decades. As AI workloads scale from billions to hundreds of billions of parameters, the underlying network infrastructure must transform to keep pace with the changes.

Following are some of the challenges encountered with AI networking:

The Bandwidth Challenge

A single NVL72 rack from Nvidia delivers 72 terabits of data across 180 connections and over 2,880 fiber connections. Bandwidth requirements are increasing by orders of magnitude—from 200 Gbps to 400 Gbps, 800 Gbps, and now 1.6 terabits per second. The critical challenge: scaling from a single rack to multiple racks while maintaining the performance AI workloads demand.

Latency: Every Nanosecond Counts

Silicon providers like Broadcom have optimized their chips for AI workloads, reducing latency from 600-750 nanoseconds down to approximately 250 nanoseconds— more than 50% reduction. But jitter, not just the latency, matters just as much. For training jobs involving thousands of coordinated nodes, even small timing variations can have cascading effects.

Rethinking Network Topology

Traditional Clos (leaf-spine) topologies are common for north-south traffic but can be limited in radix, requiring additional hieratical layers to accomplish more I/O ports within the cluster. . Dragonfly and Flattened Butterfly networks offer alternative topologies where server groups are interconnected in highly meshed patterns. This “flattens” the topology, maintains low node-to-node latency and allows easier expansion, though it introduces more cabling complexity.

Intelligent Telemetry and Security

Modern AI networks require advanced telemetry spanning the entire path from user workload to GPU to switch and back. A single link failure can trigger cascading alerts across thousands of connections, making AI-powered telemetry essential for filtering noise and identifying root causes.

Security must shift from "trust but verify" to zero-trust: verify everything, trust nothing. While this adds overhead, it's essential for protecting valuable AI workloads and sensitive training data.

Cooling: From Air to Liquid

Next-generation switches face thermal challenges—a 1.6 terabit transceiver can consume up to 24 Watts. Air cooling is becoming impractical for 2RU switch form factors. Direct liquid cooling (DLC) for network switches, aligning with server rack cooling infrastructure, is expected to take center stage in this year.

Edge Computing  Infrastructure

While hyperscale data centers focus on training, inference is moving to the edge. Autonomous vehicles, medical diagnostics, and industrial automation require low-latency inference. These applications may not need massive bandwidth, but reliability and latency remain critical.

Key Considerations for AI networking Data Centers

  • Infrastructure Alignment: Use common infrastructure across your data center. Mixing air-cooled and DLC environments creates unnecessary complexity and cost.
  • Interoperability: Modern data centers run heterogeneous environments with InfiniBand, Ethernet, and various protocols. Full orchestration is essential for full stack unification.
  • Scalability: Technologies like expanded beam optics (EBO), high-density trunking, and shuffle panels help manage increasing fiber complexity.
  • Future Technologies: Linear pluggable optics (LPO) and co-packaged optics (CPO) will reduce power consumption by more than 50% and help address systme-level thermal challenges.

The Path Forward

AI data center networks require three focus areas: ultra-high bandwidth with low latency and reliability, operational efficiency through aligned infrastructure, and continuous innovation in ASIC and switch design.

Software-defined networking is evolving toward AI-powered orchestration spanning network devices, GPUs, servers, and workloads. We're seeing shifts in physical media including adoption of  EBO (Expanded Beam Optics), transtion from traditional transceivers to LPO (Linear Plugable Optics) and CPO (Co-Packaged Optics), and SerDes signaling from 100 Gbps PAM4 to 200 Gbps PAM4.

The next generation of network architecture isn't just about faster speeds or more bandwidth. It's about building holistic systems that can scale efficiently, operate reliably, and adapt to demanding AI workloads. Collaboration through initiatives like the Open Systems for AI (OSAI) consortium will be crucial in defining standards and best practices for this new AI era.

Here is the VoD for OCP talk by Michael Lane at OCP, 25.

No items found.