Key Hyperscalers And Chip Makers Gang Up On Nvidia’s NVSwitch Interconnect

The generative AI revolution is making strange bedfellows, as revolutions and emerging monopolies that capitalize on them, often do.

The Ultra Ethernet Consortium was formed in July 2023 to take on Nvidia’s InfiniBand high performance interconnect, which has quickly and quite profitably become the de facto standard for linking GPU accelerated nodes to each other. And now the Ultra Accelerator Link consortium is forming from many of the same companies to take on Nvidia’s NVLink protocol and NVLink Switch (sometimes called NVSwitch) memory fabric for linking GPUs into shared memory clusters inside of a server node and across multiple nodes in a pod.

Without a question, the $6.9 billion acquisition of Mellanox Technologies, which was announced in March 2019 and which closed in April 2020, was a watershed event for Nvidia, and it has paid for itself about three times over since Mellanox was brought onto the Nvidia books.

The networking business at Nvidia was largely driven by Quantum InfiniBand switching sales, with occasional high volume sales of Spectrum Ethernet switching products to a few hyperscalers and cloud builders. And that Ethernet business and experience with InfiniBand has given Nvidia the means to build a better Ethernet, the first iteration of which is called Spectrum X, to counter the efforts of the Ultra Ethernet Consortium, which seeks to build a low-latency, lossless variant of Ethernet that has all of the goodies of congestion control and dynamic routing of InfiniBand (implemented in unique ways) with the much broader and flatter scale of Ethernet, with a stated goal of eventually supporting more than 1 million compute engine endpoints in a single cluster with few levels of networking and respectable performance.

NVLink started out as a way to gang up the memories on Nvidia GPU cards, and eventually Nvidia Research implemented a switch to drive those ports, allowing Nvidia to link more than two GPUs in a barbell topology or four GPUs in a crisscrossed square topology commonly used for decades to create two-socket and four-socket servers based on CPUs. Several years ago, AI systems needed eight or sixteen GPUs sharing their memory to make the programming easier and the datasets accessible to those GPUs at memory speeds, not network speeds. And so the NVSwitch that was in the labs was quickly commercialized in 2018 on the DGX-2 platform based on “Volta” V100 GPU accelerators.

We discussed the history of NVLink and NVSwitch in detail back in March 2023 a year after the “Hopper”  H100 GPUs launched and when the DGX H100 SuperPOD systems, which could in theory scale to 256 GPUs in a single GPU shared memory footprint, debuted. Suffice it to say, NVLink and its NVLink Switch fabric have turned out to be as strategic as to Nvidia’s datacenter business as InfiniBand is and as Ethernet will likely become. And many of the same companies that were behind the Ultra Ethernet Consortium effort to agree to a common set of augmentations for Ethernet to take on InfiniBand are now getting together to form the Ultra Accelerator Link, or UALink, consortium to take on NVLink and NVSwitch and provide a more open shared memory accelerator interconnect that is supported on multiple technologies and is available from multiple vendors.

The kernel of the Ultra Accelerator Link consortium was planted last December when CPU and GPU maker AMD and PCI-Express switch maker Broadcom said that the xGMI and Infinity Fabric protocols used to link its Instinct GPU memories to each other and also to the memories of CPU hosts using the load/store memory semantics of NUMA links for CPUs would be supported on future PCI-Express switches from Broadcom. We had heard that it would be a future “Atlas 4” switch that adheres to the PCI-Express 7.0 specification, which would be ready for market in 2025. Jas Tremblay, vice president and general manager of the Data Center Solutions Group at Broadcom, confirms that this effort is still underway, but don’t jump to the wrong conclusion. Do not assume that PCI-Express will be the only UALink transport, or that xGMI will be the only protocol.

AMD is contributing the much broader Infinity Fabric shared memory protocol as well as the more limited and GPU-specific xGMI, to the UALink effort, and all of the other players are agreeing to use Infinity Fabric as the standard protocol for accelerator interconnects. Sachin Katti, senior vice president and general manager of the Network and Edge Group at Intel, said that the Ultra Accelerator Link “promoter group” that is comprised of AMD, Broadcom, Cisco Systems, Google, Hewlett Packard Enterprise, Intel, Meta Platforms, and Microsoft is looking at using the Layer 1 transport level of Ethernet with Infinity Fabric on top as a way to glue GPU memories into a giant shared space akin to NUMA on CPUs.

Here is the concept of creating the UALink GPU and accelerator pods:

And here is how you use Ethernet to link the pods into larger clusters:

No one is expecting to link GPUs from multiple vendors inside one chassis or maybe even one rack or one pod of multiple racks. But what the UALink consortium members do believe is that system makers will create machines that use UALink and allow accelerators from many players to be put into these machines as customers build out their pods. You could have one pod with AMD GPUs, one pod with Intel GPUs, and another pod with some custom accelerators from any number of other players. It allows commonality of server designs at the interconnect level, just like the Open Accelerator Module (OAM) spec put out by Meta Platforms and Microsoft allows commonality of accelerator sockets on system boards.

Wherefore Art Thou CXL?

We know what you are thinking: Were we not already promised this same kind of functionality with the Compute Express Link (CXL) protocol running atop of PCI-Express fabrics? Doesn’t the CXLmem subset already offer the sharing of memory between CPUs and GPUs? Yes, it does. But PCI-Express and CXL are much broader transports and protocols. Katti says that the memory domain for pods of AI accelerators is much larger than the memory domains for CPU clusters, which as we know scale from 2 to 4 to sometimes 8 to very rarely 16 compute engines. GPU pods for AI accelerators scale to hundreds of compute engines, and need to scale to thousands, many believe. And unlike CPU NUMA clustering, GPU clusters in general and those running AI workloads in particular are more forgiving when it comes to memory latency, Katti tells The Next Platform.

So don’t expect to see UALinks lashing together CPUs, but there is no reason to believe that future CXL links won’t eventually be a standard way for CPUs to share memory – perhaps even across different architectures. (Stranger things have happened.)

This is really about breaking the hold that NVLink has when it comes to memory semantics across interconnect fabrics. Anything Nvidia does with NVLink and NVSwitch, its several competitors need to have a credible alternative – whether they are selling GPUs or other kinds of accelerators or whole systems – for prospective customers – who most definitely want more open and cheaper alternatives to the Nvidia interconnect for AI server nodes and rackscale pods of gear.

‘When we look at the needs of AI systems across datacenters, one of the things that’s very, very clear is the AI models continue to grow massively,” says Forrest Norrod, general manager of the Data Center Solutions group at AMD. “Everyone can see this means that for the most advanced models that many accelerators need to work together in concert for either inference or training. And being able to scale those accelerators is going to be critically important for driving the efficiency, the performance, and the economics of large scale systems going out into the future. There are several different aspects of scaling out, but one of the things that all of the promoters of Ultra Accelerator Link feel very strongly about is that the industry needs an open standard that can be moved forward very quickly, an open standard that allows multiple companies to add value to the overall ecosystem. And one that allows innovation to proceed at a rapid clip unfettered by any single company.”

That means you, Nvidia. But, to your credit, you invested in InfiniBand and you created NVSwitch with absolutely obese network bandwidth to do NUMA clustering for GPUs. And did it because PCI-Express switches are still limited in terms of aggregate bandwidth.

Here’s the funny bit. The UALink 1.0 specification will be done in the third quarter of this year, and that is also when the Ultra Accelerator Consortium will be incorporated to hold the intellectual property and drive the UALink standards. That UALink 1.0 specification will provide a means to connect up to 1,024 accelerators into a shared memory pod. In Q4 of this year, a UALink 1.1 update will come out that pushes up scale and performance even further. It is not clear what transports will be supported by the 1.0 and 1.1 UALink specs, or which ones will support PCI-Express or Ethernet transports.

NVSwitch 3 fabrics using NVLink 4 ports could in theory span up to 256 GPUs in a shared memory pod, but only eight GPUs were supported in commercial products from Nvidia. With NVSwitch 4 and NVLink 5 ports, Nvidia can in theory support a pod spanning up to 576 GPUs but in practice commercial support is only being offered on machines with up to 72 GPUs in the DGX B200 NVL72 system.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

11 Comments

  1. “And one that allows innovation to proceed at a rapid clip unfettered by any single company.”

    That’s exactly backwards. Nvidia, a single company motivated to chase a strategic market, is where the innovation happened. This consortium is just trying to catch up. The reason NVLink exists is because the PCI group wasn’t innovating and Intel tried to throw Nvidia off its platform as best it could. Then when CCIX, Gen Z, and CXL were being developed Nvidia was saying “we need X” and they were saying “no we are building Y”. Kudos to those other companies for trying to respond but let’s not invert where the innovation and where the market misses have come from as part of some PR blurb.

  2. Players are concocting a plan on how to compete among themselves for sockets in the sliver of the ML pie Nvidia doesn’t already own. It’s going to be a slow process and fraught with technical challenges while in the background players will jockey for technical advantage. Not a roadmap to success.

  3. I apologize in advance for the length and the possibly pandantic nature of this post but it was a response to a post at Phoronix for a poster who was a bit confused about the meaning of this news release. Make some tea and see if it has any merit. Thx.

    CXL is the heterogeneous memory protocol for connecting CPUs to CPUs, CPUs to GPUs and CPUs to Accelerators that are not GPUs like FPGAs and DSPs. It now contains the work of other heterogeneous protocols such as Gen-Z for heterogeneous communication from rack to rack and CCIX formerly of ARM and OpenCAPI from IBM. Intel spearheaded CXL before all the other consortiums gave over their protocols to the CXL group. You can look at CXL as the logical extension and evolution of PCIe.

    You can also look at UALink as the hardware or physical layer to the UXL Foundations software framework. UXL is the foundation set up to promote a unified accelerator framework leveraging Intel’s OneAPI as the alternative to Nvidia”s CUDA. Intel also heads this one up along with Google, ARM, Samsung, Imagination, Qualcomm, Fujitsu and VMware which by the way is owned now by Broadcom who is also part of UAlink. UXL can also be looked at as the successor to AMDs HSA Foundation which was killed off by Lisa Su after she arrived at AMD and killed off the Fusion project. UXL, with the exception of AMD and Google, has pretty much all the other companies that were at one time involved with HSA and for the same reasons that being to have a unified software framework for heterogeneous accelerators whereas in addition UAlink is now the physical side of the former HSA.

    So taking all this into account it becomes clear to me at least that the Compute Industrial Complex that’s not Nvidia has chosen Intel heterogeneous hardware links ( CXL ) for everything but GPU to GPU communication and has chosen Intel’s OneAPI as its alternative to CUDA as shown with the UXL Foundation that being the successor to AMDs HSA. And all AMD gets for its pioneering work with HyperTransport and later Fusion and HSA is a pat on the head and the inclusion of its Infinity Fabric for nothing more than heterogeneous GPU to GPU communications.

    I mean, that is important and necessary in this age of AI and Global Compute. And also an acknowledgment that Infinity Fabric is superior to other protocols for this purpose. But AMD also needs to acknowledge that once again the industry has rejected its work concerning 3DNow and HSA and ROCm. Lisa Su needs to take a VERY hard look at the Compute Industrial Complex and realize that the only way to compete against Nvidia is to have superior hardware and more importantly superior AND UNIFIED software. AMD makes superior hardware to Intel….full stop. AMD makes shitty software….full stop. AMD needs to rally around Intel’s CXL and OneAPI, leverage the industry support of Infinity Fabric for GPU communication and CEASE development of ROCm and pour those freed up resources and money into further making hardware that runs Intel’s physical and software frameworks better and cheaper than Intel itself. THAT’S….how you compete against Nvidia. Nvidia is the Apple Computer of the Compute Industrial Complex. They make both superior hardware and software in house. The x86 world is the Android version of the Compute Industrial Complex. But by rallying around Intel’s CXL and Intel’s UAlink ( leveraging AMD’s Infinity Fabric ) and Intel’s OneAPI through the UXL Foundation ( the successor to AMDs HSA ) the x86 world might have a chance to competently compete against Nvidia by the end of the decade.

  4. @EC not to, foxed for ya:
    “AMD has concocted a plan on how to -avoid- competition from players like Intel to avctively damage its solution of competing for the sliver of the ML pie Nvidia doesn’t already own.”

    Basically, AMD and Broadcom are those with THE tech. The rest are along for the ride, at this moment. This is more like when Microsoft, basically, informed Intel that they are going with AMD64 and they better behave or be left behind …

  5. Do you know why they’re not using ethernet for scale-out and scale-up? I can see why PCIE is not used for scale-up (i.e. connect GPUs inside a node). But Ethernet switches are ALOT faster than PCIE switches. It’s puzzling why develop UALink instead of just use Ethernet. That way they can leverage Ultra Ethernet — which is used for scale-out.

    • It’s about load/store, and one implementation of it will run atop Ethernet. Lots more bandwidth in those Broadcom switch ASICs than in the PCI-Express ASICs — at least for now. Perhaps they will accelerate the PCI-Express roadmap–Jas has wanted to do that and skip PCI-Express 6.0 and go right to PCI-Express 7.0, but none of the server and GPU makers were ready.

      • Thanks. It’s my wildly generalized take on what I’ve seen happen in the industry over the last 20 years or so. And based on a decidedly layman’s viewpoint. I would also like to offer a clarification on my point of UAlink being “Intel’s UAlink”. Of course Infinity Fabric is AMD’s interconnect technology and a damn fine one at that. This is why it was chosen for the underpinnings of UAlink. But let’s be honest. Without Intel’s name on the banner page and throwing their support towards the standard UAlink would not have the same traction that it will have otherwise. Yes, I know Cisco is there and HP and Microsoft. But AMD has had those endorsements before for other tech frameworks only to see initial support fade away once Intel pulled their head out of their nether regions and came up with something else. With Intel being in the ranks of support for UAlink along with it being the leader of CXL and UXL and as original member of Ultra Speed Ethernet I think we’re finally seeing a unified approach to compete with Nvidia. I am glad actually to see a great technology developed by AMD as part of this mix.

      • Yes exactly… those Broadcom Ethernet switches are very fast at 51Tbps. That’s why I don’t understand why we need yet another standard with UALink instead of just modifying the existing Ethernet standard to accommodate this need.

  6. Moore’s Law is coming closer than you think.
    1 Solution here ; polymers from LWLG , mark my words !!!

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.