Scale Up vs. Scale Out in AI Networks

I networking is entering a new phase. For years, the industry has focused on scale out with more pods, switches and more bandwidth between racks. Scale out still matters, but it’s no longer the only driver of training performance.

Scale up is where the next constraint is forming and is where Hyve is building.

Understanding the Two Networks in Every AI Cluster

Every AI cluster runs on two distinct networks and conflating them is where design problems start.

The scale up network handles GPU-to-GPU connectivity inside the server and across the rack. It is the local fabric that keeps GPUs synchronized during training. Limits like NV72 and NV144 show up here, because scale up caps how far GPU-to-GPU communication can extend before the workload becomes dependent on scale out infrastructure.

The scale out network connects pod-to-pod. It is what links pods together so 10,000-plus GPUs can operate across a single training run. It expands as pods are added, It is the network that grows with the cluster.

These two networks work together, but they are not interchangeable. They solve different problems. Scale out is how the cluster expands. Scale up is how GPUs communicate and stay synchronized while training is running. Treating them as separate design problems makes it easier to identify where limits appear first, and what has to change to move past them.

Rail-Aligned Networks: Why GPU One Talks to GPU One

Once scale up and scale out are understood as separate layers, the next question is how GPUs inside the server connect to the switches above the rack.

In a typical server, eight GPUs are physically presented as four interfaces. That means GPU one and GPU two share one port on the way out, and the same pattern repeats across the remaining GPUs.

From there, topology determines the wiring. One approach is a fat-tree design, where all four ports connect to a single switch. Another is rail-aligned, where GPU one connects to switch one, GPU two connects to switch two, and that mapping stays consistent across every servers in the cluster.

Rail alignment ensures GPU ones communicate with \GPU ones across the cluster, GPU twos with GPU twos, and so on. The goal is parallelization at the cluster level, built from a repeatable pattern at the rack level.

Why the Focus Is Shifting

Scale out is what makes large training environments possible. Scale up determines how those environments perform once they’re running.

That is why the design opportunity is shifting toward scale up. Large deployments are increasingly built around alternative accelerators, including tensor processors, other GPUs, and custom ASICs. In those environments, third-party scale up solutions become essential for GPU-to-GPU communication within the server and rack.

Hyve’s positioning follows that shift. The focus is on the scale up network opportunity, and the design decisions that start at the rack level and scale into the cluster. The point is to lead where the next constraint is forming, not to offer another take on traditional scale out switching.

No items found.