AI networking is entering a new phase. For years, the industry has focused on scale out with more pods, switches and more bandwidth between racks. Scale out still matters, but it’s no longer the only driver of training performance.
Scale up is where the next constraint is forming and Hyve is actively addressing this challenge.
Understanding the Two Networks in Every AI Cluster
Every AI cluster runs multiple distinct networks and conflating them is where design problems start.
The scale up network handles xPU-to-xPU connectivity inside the server and within the rack. It is this local proprietary fabric that keeps xPU resources synchronized during training. Limitations often exist, because the low latency, low jitter scale up fabrics are constrained within the rack of xPU servers.
The scale out network interconnects xPUs rack-to-rack or pod-to-pod enabling high-performance communication for scalable AI workloads leveraging 10,000+ xPUs across a single training run. Leveraging a high-bandwidth, low-latency CLOS topology, this fabric can scale to support a significantly larger cluster (10x) with consistent performance.
These two networks work together, but they are not interchangeable. They solve different problems. Scale up is how GPUs communicate and stay synchronized while training is running. Scale out is how the cluster expands. Treating them as separate design problems makes it easier to identify where limits appear and how they are addressed.
Rail-Aligned Networks: Why GPU One Talks to GPU One
Once scale up and scale out are understood as separate layers, the next question is how GPUs inside the server connect to the switches beyond the server rack.
In a typical server, eight xPUs are physically presented as four interfaces. That means xPU one and xPU two share one port on the way out, and the same pattern repeats across the remaining xPUs.
From there, topology determines the wiring. One approach is a fat-tree design, where all four ports connect to a single switch. Another is rail-aligned or rail-optimized, where xPU one connects to switch one, xPU two connects to switch two, and that mapping stays consistent across every server in the rack, pod and cluster.
Rail alignment ensures xPU ones communicate with xPU ones across the cluster, xPU twos with xPU twos, and so on. The goal is parallelization at the cluster level with the highest level of redundancy and lowest latency with a repeatable pattern starting at the rack level.
Why the Focus Is Shifting
Scale out is what makes large training environments possible. Scale up determines how those environments perform once they’re running.
That is why the design opportunity is shifting toward scale up. Large deployments are increasingly built around alternative accelerators, including tensor processors, other xPUs, and custom ASICs. In those environments, third-party scale up solutions become essential for xPU-to-xPU communication within the server and rack.
Hyve’s positioning follows that shift. The initial focus is on the scale out network with high-performance, high-radix switching, however, more attention is being directed toward the scale up network. What once was considered a proprietary, in-rack fabric is standard Ethernet-based solutions exist opening the opportunity to make design decisions that leverage similar silicon and switching solutions at scale with no constraints.




