When you are designing applications that run across the scale of an entire datacenter and that are comprised of hundreds to thousands of microservices running on countless individual servers and that have to be called within a matter of microseconds to give the illusion of a monolithic application, building fully connected, high bi-section bandwidth Clos networks is a must.
This is especially true as application servers, middleware servers, database servers, and storage servers can be anywhere within a datacenter. You never know what in the network is going to need to talk to what else. So you overprovision on bandwidth and connectivity and keep the tail latencies as small as possible.
But high bandwidth Clos networks are not necessarily the best architecture for an AI training system. Particularly given how expensive networking has become for AI clusters. With the cost and complexity of AI networks on the rise, something has to give. And that is why researchers at the Computer Science and Artificial Intelligence Laboratory at MIT have been working with their networking peers at Meta Platforms to think outside of the box. Or perhaps more precisely, think about what is already in the box – in an effort to eliminate an expensive layer of switching from AI networks and therefore cut costs drastically without reducing the performance of AI training.
The resulting rail-only network architecture that CSAIL and Meta Platforms have come up with was described in a recent paper and presented at the Hot Interconnects 2024 conference this week, and it definitely wins the “well that’s pretty obvious once you think about it” award from The Next Platform. We love such “obvious” insights because they often turn technologies on their head, and we think what the CSAIL and Meta Platforms researchers have figured out has the potential to transform network architecture for AI systems in particular.
Before we get into this rail-only architecture insight – which we might have called an inverted spine network based on what they actually did – let’s set the stage a little bit.
Clos networks are a way to connect any node or element within a node (such as a GPU or DPU) to all of the other node or element within the entire datacenter. These Clos networks are not the only way to do all-to-all links between devices on a network. Many supercomputing centers use dragonfly topologies these days, but when you add machines, you have to rewire the entire network, unlike Clos topologies, which allows this fairly easily but which does not provide consistent latency across the network as a dragonfly network does. (We discussed these topology issues back in April 2022 when we analyzed Google’s proprietary “Aquila” network interconnect, which is based on a dragonfly topology.)
As you are well aware, the big AI training systems take on the order of 24,000 to 32,000 GPUs to train a big model with trillions of parameters in a relatively timely fashion. The number of GPUs used in a system at Meta Platforms today to train its Llama 3.1 405B model is 24,576, as we have previously reported, and CSAIL and Meta Platforms expect that the next-gen models are looking at spanning 32,768 GPUs in a single cluster. The Clos networks are based on Ethernet leaf and spine switches, all with remote direct memory access (RDMA) support so GPUs can share data with all of the other GPUs in the network at the same time using that all-to-all topology.
Weiyan Wang, a doctoral student at CSAIL, did the presentation on the rail-only architecture at Hot Interconnects, and said that building a Clos network with high bandwidth to interconnect over 32,000 GPUs would cost $153 million, and the network all by itself would consume 4.7 megawatts of electricity. The paper is a little more precise on the network speed for another comparison, saying that a full bisection bandwidth Clos fabric linking 30,000 GPUs using 400 Gb/sec links would cost $200 million. Suffice it to say, this is a lot of money. Much more money than any hyperscaler and cloud builder is used to spending to connect 4,096 server nodes together.
Here is a very interesting chart that Wang put together showing the interplay of network cost and network power as the AI clusters are scaled:
At a doubling of GPU count to 65,536 devices, network would cost $300 million at 400 Gb/sec port speeds and would consume around 6 megawatts of power.
Most of the GPU clusters that run large language models use what is called a rail-optimized network, and it is a variant of the leaf/spine network that is familiar to readers of The Next Platform. This is what the comparisons are for in the data above. It looks like this:
You have to organize the compute elements and the way that work is dispatched to them in some fashion, and the interesting bit about these rail-optimized networks is that they aggregate the ranks of the calculating devices across rails. So the first compute engine in each node is connected on one leaf switch, the second compute engine in each node is on another leaf, and so on.
To give a more precise – and as you will see, more relevant – example, Wang showed how a cluster with 128 of Nvidia’s eight-way DGX H100 nodes would have their GPUs interlinked with a total of 128 leaf switches, with two leaf switches per rail to do the eight different GPU ranks across the cluster:
Here is the insight that the CSAIL and Meta Platforms researchers made. They wondered what the traffic patterns were across the rails and up into the spine switches as an LLM was training, and they made an amazing and very useful discovery: Most of the traffic stays within the rails, and does not cut across the rails:
The tests that CSAIL and Meta Platforms ran were not for the Llama 3 models from the social network, but rather on variants of the OpenAI GPT family of models with different parameter counts.
And here is a further drilldown on the traffic patterns of the Megatron GPT-1T model:
Whether there is pipeline parallelism, tensor parallelism, or data parallelism, only very rarely does the traffic go up into those expensive spin switches that interlink the leaf-based rail switches. Aha!
So what you can do is just chop the heads off the network. Get rid of the spine aggregation switches entirely.
But wait a minute, you say. What about those rare times when you need to share data across rails? Well, it just so happens that each HGX node inside of a DGX server (or one of its clones) has a bunch of very high-bandwidth, very low latency NVSwitch memory fabric switches inside. And rather than bounce data up from the leaves to the spines to cross rails, you can bounce it over to an adjacent rail using the NVSwitch fabric.
Genius!
A testament to seeing what is right there in front of all of our eyes, the rail-only network:
And this is why we call it an inverted spine switch. It is not that the spine switch is not needed, but that NVSwitch has enough capacity to do the job for the little bit of bandwidth and time when it is needed. (There is no Infinity Fabric switch from AMD, so this may not work with AMD’s GPUs.)
You have to be careful about this, of course. You have to place the shards and replicas that drive tensor parallelism and data parallelism on the same rail in the network for this to work.
Here is the upshot of this simple switch (so to speak) from rail-optimized to rail-only networks when it comes to the reduction in the cost of switches and transceivers, the latter of which utterly dominate the cost of the overall network:
Nvidia may come to regret having such a powerful switch at the heart of the HGX system board. . . . But probably not. Even Nvidia knows that the network can’t represent 20 percent or 25 percent or more of the system cost for AI to proliferate.
To bring it on back to the cluster of 128 DGX H100 servers used in the example above, with the rail-optimized network, you need 20 128-port switches across the spines and leaves to interconnect those 1,024 GPUs in the back-end network. You also need 2,688 transceivers to links the GPUs to the leaves and the leaves to the spines. With the rail-only network, you drop to eight switches for the rails and a mere 1,152 transceivers to put the GPUs on eight separate rails. You use the hundreds of NVSwitch ASICs on the HGX boards as the rarely used inverted spine aggregation layer. This will save 41 kilowatts of power and eliminate $1.3 million in network costs.
In benchmark tests, there was no performance impact on 3D parallelism for LLM training of this rail-only approach and there was only an 11.2 percent performance overhead on all-to-all communications in the cluster. In the LLM models trained, all-to-all communication was only 26.5 percent of total communication, so we are talking about a 2.86 percent on communication performance, and remember communication time is only a fraction of the overall wall time during an LLM training run. So the effect of using NVSwitches as the occasional spine is negligible.
This may or may not be the case on other kinds of data analytics, AI training, or HPC simulation workloads. But it will be interesting for people to try to figure this out.
Cell based fabric in Arista made sure to evenly use the network links in a rail optimised design and avoids the idle condition of links.
Ah yes, but this eliminates the spine and its costs and all of those transceivers! And makes use of networking already in the box.
Seems a very complex way to describe a well-localized geometry of processing, so you can use a low bandwidth global link.
This only duplicates the processor-cache1-cache2-cache3-etc locality of processing pattern known for a thousand years.
This is a interesting concept and is similar to the host network concept described in my book (found here https://www.amazon.co.uk/introduction-compartmentalisation-management-migration-micro-segmentation/dp/B09L3PNXZL)
Where this article talks about the use of shortest routes to remove switches from the network fabric, my concept is discuss in the context of improving speed and security within business domains and private clouds. I love to explore if these two concepts could support each other in a wider range of workloads.
Cool idea to grant this the TPM “obvious once you think about it” (or OOY-TAI) award! As long as the HBDs have fast all-to-all interconnects, severing (or inverting) the spine does look like a winning proposition. The “outer” network then goes from a rather hierarchical leaf-spine, spine-rib, or hub-spoke Clos configuration, with ingress/afferent, central/middle, and egress/efferent parts, into more of a synaptic exoskeleton, a Fujitsu-style n-D TOFU torus, a near-mesh, or a dragonfly hairnet it seems, but with crisscrossing tracks (or bias ply).
I’d invest the money saved by going spineless on getting the fastest possible switches ever for that rail network though, not putting it in a piggy bank.
Grammar police here: it’s “comprised from”, “composed of”. Comprised basically means “squished together”, so “it was squished together from parts”. I thought this was the first thing they teach you when you get a job in writing. 😉