UALink Fires First GPU Interconnect Salvo At Nvidia NVSwitch

Compute engine makers can do all they want to bring the performance of their devices on par or even reasonably close to that of Nvidia’s various GPU accelerators, but until they have something akin to the NVLink and NVSwitch memory fabric that Nvidia uses to leverage the performance of many GPUs at bandwidths that dwarf PCI-Express switches and latencies that  dwarf Ethernet interconnects, they can never catch up.

Which is why the publication today of the Ultra Accelerator Link 1.0 specification is a landmark event in the AI revolution.

The UALink effort was started back in May 2024, and is a follow-on and broader effort that AMD and Broadcom announced back in December 2023. Initially, AMD and Broadcom were working together to put the xGMI protocol for memory load/store semantics on AMD’s GPUs and CPUs on top of Broadcom’s “Atlas” PCI-Express switches. But after careful consideration, AMD and Broadcom wanted to do more than this and take on NVLink and NVSwitch head on and create a more streamlined and different set of ASICs and protocols and not just kluge something together that would sort of work. (Like the CXL memory protocol over PCI-Express, which has some serious limitations in terms of bandwidth and radix.)

And so AMD and Broadcom asked Cisco Systems, Google, Hewlett Packard Enterprise, Intel, Meta Platforms, and Microsoft to help them create UALink as an alternative to what Nvidia has for accelerator memory fabrics. AMD contributed its xGMI and broader Infinity Fabric protocols, which work on physical transports that are a mix of AMD’s own HyperTransport NUMA interconnect and PCI-Express, which is well established, and then the group got to work hacking away at PCI-Express to trim out all of the cruft to create a new Data Layer and Transport Layer that is unique to UALink and marrying it to a modified set of Ethernet SerDes that are wrapped around it. The combination looks like it will be better than CXL over PCI-Express, which really has not taken off on accelerators as a way to create NUMA domains across those accelerators, and will be on par with what Nvidia has to offer with its NVSwitch memory fabric today and into the future.

The UALink 1.0 specification was expected to be delivered a bit earlier in 2025, and clearly it took a little extra time to get out. We can forgive the UALink Consortium vendors for their exuberance and optimism, particularly if they do deliver something that will give us what is akin to NVLink and NVSwitch for the masses.

If this technology pans out as planned, there is every reason to expect that Nvidia will be compelled by market forces to move to the UALink standard, but don’t expect that to happen any time soon. AMD, Intel, and others will have to sell a lot more compute engines using UALink before that happens, and demonstrate its benefits in terms of price or performance.

If UALink turns out to be a block copy replacement for NVSwitch at the system and rack level, with minimal design changes for AI models and system designs, all the better for competition in the AI systems business. This is particularly important since NUMA node domain size across GPUs and other kinds of math accelerators is important for AI training and now, with the advent of chain of thought “reasoning” models, with AI inference. When all we were doing was one shot, blurty answers from chatbots, GPU memory domain was not an issue, and regular PCI-Express interconnects were fine for inference.

That’s no longer the state of the art. We used to joke that in the long run, AI training would eventually look like AI inference. But the opposite is turning out to be true. AI inference is getting way more complex, and the chain of thought approach is feeding back into AI training. Just look at how DeepSeek is blurring the lines between training and inference.

Doing This One On Spec

The UALink members obfuscated a bit when the group was formed last summer about precisely how and what they would do. There was some talk about PCI-Express and Ethernet not being the right stuff, for reasons we have covered time and again here at The Next Platform, and plenty of mutterings by ourselves and others about what might be done. What has been done is simple and elegant, and should be a fairly easy for the networking ecosystem to adopt and productize. Companies that make PCI-Express switches – Astera Labs, Broadcom, Marvell, and Microchip – will want to make UALink switches, which we would call UASwitch to distinguish it from the UALink ports on the compute engines.

Working from the outside in, the UALink stack starts with a slightly modified Ethernet SerDes that runs at a 212.5 GT/sec signal rate, which is whittled down to a 200 Gb/sec of bandwidth per UALink lane once the encoding overhead is taken into account:

This Ethernet physical layer has standard forward error correction (FEC) and adheres to the IEEE P802.3dj specification. Latency is improved by one-way and two-way code word interleaving, and there is a slight change to support 680 byte flit. (The flit, or flow control unit, is the atomic unit of data at the link level.) Here’s the neat bit, and where PCI-Express has changed with the 6.0 specification and laid the groundwork for UALink.

With PCI-Express 6.0, the PCI-SIG that controls that standard (and which was largely steered by Intel), rather than just implementing standard FEC, which would have radically increased the latency of PCI-Express data transmissions, PCI-Express moved to a mix of flow control and cyclic redundancy check (CRC) error detection that actually improved the reliability of signal transmission while dropping the latency. Some of this smarts is being added to UALink, and a whole lot of things that are not necessary for memory fabrics is not in there.

“We start with 200 Gb/sec SerDes,” Peter Onufryk, an Intel Fellow who is working on UALink at the behest of his employer, tells The Next Platform. “It has four lanes per port, for 800 Gb/sec, and you can aggregate multiple ports. You can also go up to a 1,024 accelerators in a fabric, so it scales pretty high in the space that we are at. UALink is a simple protocol, so it’s not PCI Express, but it’s optimized for scale up fabrics, with simple memory reads and writes and atomics, large operations. It removes the ordering constraints of PCI-Express, so the only ordering is within a 256 byte boundary. But if you are across, you can reorder. The way to think about UALink is that it has the latency of a PCI-Express switch, the power of a PCI-Express switch, the area of a PCI-Express switch, but with Ethernet SerDes.”

Which makes us wonder if this would not make a perfect Ethernet switch. . . . But let’s do that another day and just talk about the UALink link stack for now. Take a look:

Incidentally, that 1,024 compute engine coherence limit is on a single level of UALink switching infrastructure. If you want to add more levels – which adds more latency – you can build a larger NUMA domain for the compute engines.

The UALink 1.0 spec supports 100 Gb/sec and 200 Gb/sec speeds per lane, with the former being used to make 100 Gb/sec, 200 Gb/sec, and 400 Gb/sec ports and the latter being used to make 200 Gb/sec, 400 Gb/sec, and 800 Gb/sec ports. We don’t know how many ports a future UASwitch might have, so we don’t know how it would compare to any given NVSwitch. But clearly, if Nvidia cab gang up ports to get more bandwidth out of devices, so can the UALink adopters.

We found this sentence in the UALink 1.0 presentation interesting: “Designed for deterministic performance achieving 93 percent effective peak bandwidth.” We don’t have a compared to what there, but we will start looking for one.

The presentations that UALink members were giving earlier this year said that UALink uses one half to one third the power of the die area of an equivalent Ethernet ASIC, port for port, and saves somewhere between 150 watts and 200 watts per accelerator in a memory fabric. That the smaller chip size translates into a cheaper chip and that lower power translates into less electricity and cooling consumed, and both of which make the overall TCO better.

Those presentations also said that UALink port to port hop latencies would be less than 100 nanoseconds. Depending on the radix and brand of the PCI-Express switches, Onufryk says the port hop on a PCI-Express switch runs from as low as 70 nanoseconds and as high as 250 nanoseconds. In the early merchant silicon era in the late 2000s, we saw 10 Gb/sec Ethernet switches with 350 nanosecond to 450 nanosecond latencies, and it is not unusual for a stock Ethernet switch to do 1 millisecond or even 2 milliseconds. That is high compared to the 100 nanosecond to 120 nanosecond latency of an InfiniBand switch. The UALink consortium is not enforcing latency limits, so vendors will do what they will.

Kutis Bowman, director of architecture and strategy at AMD who is co-leader of the UALink effort and chairman of the UALink Consortium, said that UALink switches with latencies between 100 nanoseconds and 150 nanoseconds “feels right.”

“It’s like anything,” says Bowman. “Once the first switches are out, they will figure ways to improve them. We will probably see some good midrange numbers, and then, as time goes on, they will slide that to the left.”

As for the radix of these switches – meaning how many lanes and ports they drive, and at what aggregate bandwidth – that is also up to the UALink switch makers.

“We have specified the physical layer, and we have specified how packets are routed by ID, and people could go build whatever they want,” says Onufryk. “It’s like PCI-Express – some people build small switches, some people go big switches, and they all try to figure out what the right place is.”

Conceptually, here is what a UALink rackscale pod might look like:

Just because the UALink 1.0 protocol can support a NUMA memory domain for accelerators that has 1,024 devices interlinked does not mean that people will jump right in and start building something that scales up that far. (But, boy, would it be fun if someone did.)

Look at how conservative Nvidia has been. NVSwitch 3 fabrics using NVLink 4 ports could, in theory, span up to 256 GPUs in a shared memory pool, but only eight GPUs were supported in commercial products from Nvidia. With NVSwitch 4 and NVLink 5 ports, Nvidia can, in theory, support a memory pool spanning up to 576 GPUs, but in practice commercial support is only being offered on machines with up to 72 GPUs in the DGX B200 and B300 NVL72 systems. And the largest domain that Nvidia has on its roadmap (at least now) has only 576 GPU chiplets in a single memory image, with four GPU chiplets per socket and 72 sockets per rack.

It looks like at some point UALink might have a scale up advantage, but a lot depends on how well the all-to-all networking that underpins AI processing works on machines with hundreds of compute engines sharing their high bandwidth memory.

It is important to realize that UALink is not a knock off of NVLink. They are different, even though it looks like NVLink is the lovechild of PCI-Express and InfiniBand. (And NVLink and NVSwitch predate the acquisition of Mellanox Technologies by Nvidia.)

“There are differences between UALink and NVLink,” says Bowman. “NVLink is a x2, so they gang two lanes together, always. UALink allows for x1, x2, or x4 for a port, and then you can gang ports after that,  just like with Nvidia you can gang NVLink ports. So there are some differences, and while they are subtle, they do help depending on the kind of a system you are trying to build and the bandwidth you need. We think 800 gigabits per direction, which is 1.6 terabits bidirectional, gives you what we think is sufficient bandwidth for the timeframe these UALink devices will be coming out.”

Normally, when a networking spec comes out, it takes about two years for the first devices using that technology to get into the field. But Bowman says this time around, is will only take twelve to eighteen months because the demand is so high and everyone who is making UALink switches knows what they are doing.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

1 Comment

  1. Ok, let’s give it a name: HyperFabric (in homage to HyperTrasport).

    One question: Does it support many CPUs to form a single system image? I am asking because your articles and all others focus only on accelerators.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.