Ethernet Consortium Shoots For 1 Million Node Clusters That Beat InfiniBand

Here we go again. Some big hyperscalers and cloud builders and their ASIC and switch suppliers are unhappy about Ethernet, and rather than wait for the IEEE to address issues, they are taking matters in their own hands to create what will ultimately become an IEEE standard that moves Ethernet forward in a direction and speed of their choosing.

This time around – and this is no surprise to readers of The Next Platform – the target is InfiniBand, the low latency, high bandwidth interconnect that is at this point essentially controlled by Nvidia thanks to the closing of its $6.9 billion acquisition of Mellanox Technologies in April 2020.

Supercomputer maker Cray was the first to try to create a variant of Ethernet that could take on InfiniBand in the HPC arena with its “Rosetta” ASIC that is the foundation of its Slingshot interconnect, which we drilled down into back in August 2019. The Slingshot interconnect has displaced InfiniBand or its Omni-Path offshoot (formerly controlled by Intel but now owned by former QLogic executives who founded Cornelis Networks) in the big exascale and pre-exascale machines being built by Hewlett Packard Enterprise in the United States and Europe. HPE bought Cray in May 2019 for $1.3 billion – and in large measure due to HPE’s desire to broadly commercialize the Slingshot interconnect for both HPC and AI workloads. Like any new interconnect, Slingshot has had growing pains, but it is working and working at a scale that the world has never seen before.

In a funny aside, Google created its “Aquila” GNet protocol, abandoning Ethernet to outdo InfiniBand, which we covered back in April 2022 and which the company has been testing with its converged NIC/switch architecture. For the past seven years, Google has deployed its homegrown “Jupiter” and “Apollo” backbone switches, which it has also deployed to interconnect nodes in its TPUv4 pods, which have 4,096 devices interlinked. Google is clearly doing its own thing when it comes to datacenter networking, even if it does still buy a lot of Ethernet and InfiniBand switching. Google has been making its own Ethernet switches based on merchant silicon from various vendors (mostly Broadcom) since 2004, but Aquila and Apollo are different in that Google is doing the chippery, not just the switch or router design using merchant silicon.

All kinds of luminaries in the datacenter want Ethernet to have the benefits of InfiniBand without having to sacrifice the compatibility with Ethernet, and in February this year they published a paper, Datacenter Ethernet And RDMA: Issues At Hyperscale, that basically crabbed about the RDMA over Converged Ethernet (RoCE) protocol. RoCE has issues running at scale, which is not as good as the Remote Direct Memory Access that is deployed in InfiniBand and gives it low latency, and which needs to be improved upon to run at the kind of scale that hyperscalers and cloud builders need for AI and HPC systems.

The issues raised in the paper – and there are many more than just RoCE – are well known to Nvidia and to the Ethernet ASIC makers, and the paper didn’t force a change in their roadmaps as much as it set the stage for switch ASIC announcements from Broadcom and Cisco Systems that were already well underway and a predictable response from Nvidia with its Spectrum-4 Ethernet switches.

Ultra Ethernet Work Already In Progress

Broadcom jumped out first in April with its InfiniBand Killer, the Jericho3-AI chip, showing how it could scale to 32,000 interconnected GPUs in a single network and with better load balancing by spraying data across all open links on the fabric and much better congestion control mechanisms to provide more deterministic performance on the “elephant flows” that are typical in AI training. Something that normal Ethernet cannot do well.

And in June, Cisco, which has aspirations as a merchant silicon chip provider for the hyperscalers and cloud builders, started gunning for InfiniBand with its Silicon One G200 and G202 switch ASICs. Cisco was showing off how it could interconnect 32,768 GPUs with a two level network with 40 percent fewer switches and 50 percent fewer optical transceivers and cables than other Ethernet alternatives (we assume it was Broadcom’s “Tomahawk 5” and Nvidia’s Spectrum-4  switches) with ASICs running at the same 51.2 Tb/sec of aggregate bandwidth.

All three of these Ethernet switch ASIC makers – HPE, Broadcom, and Cisco, two of whom make their own switches – have attacked different parts of the Ethernet hardware and protocol stack to try to improve it for AI and to make it better able to compete with InfiniBand. (Google is not part of the effort – yet.) And now they are ganging up with each other, switch upstart Arista Networks, two of the biggest hyperscalers and cloud builders – Microsoft and Meta Platforms – as well as CPU and DPU makers AMD and Intel (yes it is weird to put AMD first, but it is clearly ahead of Intel) to create the Ultra Ethernet Consortium to drive a new standard for high performance, low latency, crazy scale Ethernet fabric.

The HPC server business of Atos, now known as Eviden, is also a founding member of the Ultra Ethernet Consoritum and this is interesting in that it sells the Bull Exascale Interconnect (BXI), a commercialized version of the Portals protocol that has been evolving under its development at Sandia National Laboratories for the past three decades. BXI has been able to stand toe-to-toe with InfiniBand, and the expectation is that the 400 Gb/sec Omni-Path follow-on from Cornelis will be able to do so as well. We originally conjectured that Bull is joining the consortium because it might not want to do BXI anymore, and Eviden reached out and said that BXI was central to its strategy.

“We have taken the decision to move from a custom protocol to an evolution of the Ethernet protocol
well before the UEC started,” Eric Eppe, group vice president  of HPC/AI/Quantum portfolio and strategy for Eviden at Atos Group tells The Next Platform. “We joined the UEC as a founding member on the premises that we will benefit, like the other members, from a much wider ecosystem while retaining strong IPs and
differentiation on the AI and HPC markets.”

Cornelis definitely wants to make its own switch platform and probably will not join, but it could take the same approach that Eviden has.

All of this is probably giving you flashbacks to July 2014, when hyperscalers and cloud builders Google and Microsoft started the 25 Gigabit Ethernet Consortium and were joined by Broadcom, Mellanox, and Arista Networks to bring 25 Gb/sec signaling used in routers down into Ethernet switches that, per IEEE standards, were stuck using slower and hotter 10 Gb/sec signaling to make 100 Gb/sec ports. This was a no-go for the hyperscalers and cloud builders, and caused Broadcom to create a whole new Tomahawk family of ASICs, and Mellanox did the same with its Spectrum line. And guess what? The hyperscalers and cloud builders were right, and eventually IEEE had to endorse a new standard that it had originally rejected.

We know who is the dog and who is the tail in the IT market now, right? What was true in 2014 is even more true in 2023.

Fixing Ethernet For Hyperscale

The Ultra Ethernet Consortium is being hosted at the Linux Foundation, which is about as neutral as you can get in this world, and the founding companies are donating intellectual property and personnel to create a unified Ultra Ethernet standard that they can all eventually hew to with their future products. You can read all of the background on the Ultra Ethernet effort in this position paper, but it all boils down to this: InfiniBand is essentially controlled by a single vendor, and the hyperscalers and cloud builders hate that, and it is not Ethernet, and they hate that, too. They want one protocol with many options in terms of functionality, scale, and price.

One of the key features of the emerging Ultra Ethernet standards is the packet spraying technique for multipathing and congestion avoidance that Broadcom and Cisco have in their respective Jericho3-AI and G200 ASICs. They also want to add flexible packet ordering to the Ethernet standard, which helps the All-Reduce and All-to-All collective operations commonly used in AI and HPC applications to run better than they can when strict packet ordering is enforced.

The Ultra Ethernet standard will also address new congestion control methods that are optimized for AI and HPC workloads (and far less brittle than methods that have been developed for Ethernet fabrics supporting web and database applications running at scale). This congestion control requires end-to-end fabric telemetry, which many switch ASIC makers and switch makers have been trying to graft onto existing ASICs. They want it built in and standardized, but with enough room for vendors to create their own implementations for differentiation.

They also want a new implementation of RDMA that is more efficient and more scalable than either InfiniBand or Ethernet with RoCE. “While large lossless RoCE networks can and have been successfully deployed, they require careful tuning, operation, and monitoring to perform well without triggering these effects,” the consortium members write. “This level of investment and expertise is not available to all network operators and leads to a high TCO. A transport protocol that does not depend on a lossless fabric is needed.”

They add that the Verbs API that underpins InfiniBand RDMA and Ethernet RoCE is designed for lower bandwidth and a lower number of peers on a network, and that the Reliable Connection transport mode in these two protocols cannot keep up with the speed of current and future networks. And finally, they say they want wire rate performance at 800 Gb/sec, 1.6 Tb/sec, and faster Ethernet networks and to scale to 1 million endpoints in a single network. And if history is any guide, they are going to get it.

Your move, Nvidia.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

14 Comments

  1. During the last 20 years, the web has infinite articles why Infiniband is dead or going to die.
    Reddit , 2008 : Is Infiniband dead?
    computerworld , 2012 : What killed InfiniBand ?

    But Infiniband is alive.
    ChatGPT GPU’s interconnect is using Infiniband.
    ASIC providers, I have a question : why do you want to fight against Infiniband if it is going to die ?

    • You don’t fight against things because they are dying, but because they are thriving. This is an eternal battle between the two. May it continue long and drive innovation far.

      • That’s why we read TNP — It is unlike those other “news” outfits. Here, Infiniband is not just alive and well, but everyone is out to beat it (the analysis is more thorough):

        CISCO Guns for Infiniband (2023)
        Broadcom Takes on Infiniband (2023)
        Google abandons ethernet to outdo infiniband (2022)
        Infiniband, still setting the pace for HPC (2020)

        etc…

  2. Linux community, Open techonology, advance developments and rethink greats ideas, top engineering, added effort and know-how by competition etc.. all sound something aquarius era.

  3. Single-source is a critical flaw, especially now that it’s part of a vertically-integrated company. And to be frank, what has IB done for us lately? Sure, phy designers have provided the means for IB to step up bandwidth.

    There have been a lot of “X challenges IB” stories, but the fact is that IB remains quite niche. That is, those challengers have managed to limit IB’s TAM and thus overall importance.

    I wonder if everyone would agree that IB’s choice to ignore eth compatibility was a historic mistake.

    • I think that the 2019 Slingshot article (linked in the 3rd paragraph) nicely supports your point — essentially showing that an eth-oriented approach can sustain performant RDMA/RoCE and tail-cutting congestion management. But it did take time and effort to develop and demonstrate, and so, beforehand, niche had to be the law of the land as it were.

      Meanwhile, in the “Look mom, no InifiniBand” El Reg article of 05/29/23 we read that: “at COMPUTEX Huang announced the SPECTRUM-4 […] switch that marries Ethernet and InfiniBand, with a 400GB/s BlueField 3 SmartNIC” (also in TNP, eg. searching for Bluefield) — and so there’s competitve impetus at nV as well to evolve IB in interesting ways.

    • No, Spectrum-4 is Ethernet. Not InfiniBand. Mellanox has not converged its ASICs since the SwitchX days back in 2011, which it reversed and said it was a bad idea because it increased latency for InfiniBand.

    • Different animals for sure. Maybe we have networks that are being asked to do wildly divergent things, and AI and HPC can’t piggyback on the same interconnect?

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.