Eridu Cuts To The AI Networking Chase With High Radix Switch System
Sometimes, innovation in IT comes out of the hyperscalers and clouds. But long before these companies rose out of the ashes of the Dot Com bust, innovation often came out of the telecom and service provider sectors and was brought over to the data processing side.
So it is with a networking startup called Eridu, which has just dropped out of stealth mode and which has just raised over $200 million in Series A funding to create a very high radix switch system for AI clusters that will flatten networks and therefore drop latency and cut network costs, which accounts for somewhere north of 20 percent of the total cost of acquisition for AI cluster hardware.
Hyperscalers and cloud builders, not to mention HPC centers, do not like when networking represents more than 10 percent of the cost of a distributed computing system, and they like it even less when they see the numbers drifting up towards 30 percent when you take into account the scale up network to lash together GPU and XPU memories in into rackscale and someday rowscale systems as well as the scale out network to glue thousands or tens of thousands of rackscale (and again, soon rowscale) systems into a more loosely coupled system using some variant of the Message Passing Interface (MPI) protocol developed for HPC supercomputers many decades ago.
If you want to do a full accounting of the network costs for AI hardware, companies also have to beef up their front end networks linking application to AI clusters, too, because these are generally painfully slow compared to the scale out networks that comprise the outer rings or branches (depending on if you use toroid, mesh, or fat tree topologies) of the AI clusters.
For now, Eridu is keeping its cards pretty close on exactly what its plans are, and how it is going to make a high radix AI switching system, and as we pummeled the founders with statements about how we needed an Ethernet switch with zillions of high speed ports with super-low latency and crazy high bandwidth, all they did was nod their heads and laugh.
This is, of course, something we have railed about for a long time, particularly as we saw PCI-Express switches for interlinking GPUs not have enough ports (which is called low radix in networkspeak) as well as not enough bandwidth per lane (lanes are ganged up to make ports), and therefore not enough aggregate bandwidth to build rackscale – nevermind rowscale systems. (If the rack is the new server, then the row is the new rack, right?)
And while the feeds and speeds of the UALink interconnect specification can stand toe-to-toe with what Nvidia has done brilliantly with its NVSwitch memory fabric for its GPUs, we have seen no assurances that someone will build a UALink switch ASIC that will push the envelope on that spec and meet or even beat Nvidia at the memory fabric game. In theory, UALink switches will be able to support 1,024 GPUs or XPUs in a shared memory domain with a single level of networking, which is a heck of a lot more than the 72 GPUs in the current NVSwitch 3 and the NVLink 5 ports on Nvidia “Blackwell” GPUs. To be fair, Nvidia can, in theory, support up to 288 GPU sockets in a multi-tier NVSwitch 3 fabric, but it is not commercially supported and it adds latency to a substantial number of the memory fabric hops between GPUs.
Before digging into what little we know about Eridu, some background into its three founders is important because it shows how successful they have been in the telecom and service provider sectors, including all the way back to the birth of the Internet. Past performance is no guarantee of future performance, of course, but it is a pretty good predictor of how new ideas can come into a market with a chance of being commercialized.
Eridu co-founder Drew Perkins is a serial entrepreneur who has been at this for 45 years. Perkins got his bachelor’s degree in electrical engineering, computer engineering, and mathematics from Carnegie Mellon University in 1986, and worked at CMU while pursuing his degree and after graduating on various projects, including working on the IP protocol and creating the Point-to-Point Protocol that eventually was embedded in so many Internet applications. In 1987, he founded a company called Entelechy and designed what he thinks is the first multi-port Ethernet switch, but it was not successfully commercialized.
Perkins was a co-founder of InterStream in 1990, which was an early provider of network-attached storage, and things got really challenging when he was literally hit by a truck driven by a drunk driver and hospitalized and in recovery for more than two years. After teaching himself how to walk again and having more metal parts than the Bionic Man, Perkins joined Fore Systems, an ATM backbone supplier for the early commercial Internet, as principle architect, writing its ATM switch software stack. In 1998, Perkins was co-founder and chief technology officer of Lightera Networks, which created the CoreDirector Optical switch that compelled Ciena to buy the company for $550 million a little more than a year later. Perkins had the same titles at OnFiber Communications, which provided Internet fiber to metro areas in the United States and which was acquired by Quest in May 2001. Perkins jumped immediately to be co-founder and CTO at Infinera, which he says developed the world’s first silicon photonics ICs. Infinera went public in 2007 and was acquired by Nokia in June 2024 for $2.3 billion.
In true Steve Austin fashion and a bit eerily similar to Minority Report and Black Mirror, after getting intraocular lenses because of cataracts in both eyes in 2014 during a sabbatical from work, Perkins founded Mojo Vision to give everyone bionic eyes by doing image enhancement in “smart contacts” and projecting those enhanced images directly onto the retina. Perkins stepped down as CEO at Mojo Vision in March 2023 to start Eridu.
The two other co-founders at the networking upstart are Omar Hassen, the company’s chief product officer, and Mike Capuano, chief business development and marketing officer. Hassen was most recently senior vice president of business development of Ventana Systems, a maker of RISC-V server chips. Prior to that, he was in charge of product development at Arm server chip upstart AppliedMicro, whose Arm server CPU business laid the foundation for Ampere Computing (now part of SoftBank) and whose optical networking and transceiver chip business went to Macom. Hassen has deep experience managing divisions and groups at International Rectifier, Marvell, and Broadcom going back to the end of the Dot Com boom.
Capuano has been on the marketing and sales side of tech companies, starting at Motorola in 1992, ending up at Juniper Networks a few years later after an acquisition, and worked for Cisco Systems and then Infinera after that for more than a decade before doing a stint at Pluribus Networks, a maker or server-switch half-bloods that was acquired by Arista Networks in 2022.
So there is the pedigree. Now, let’s count the money.
Perkins tells The Next Platform that the company raised “well in excess of $200 million so far.” The chatter on the Intertubes is that the seed round was around $30 million and the Series A round was more than $200 million. This is a fairly substantial amount of money, which is what you need if you want to take on Nvidia NVSwitch and Spectrum-X on its home AI turf and if you are also competing against Broadcom, Cisco Systems, Marvell, and a handful of others who are building ASICs for scale up and scale out networks for AI systems.
The Series A funding was led by Socratic Partners, John Doerr (famous for being the tech guy at Kleiner Perkins and investing early in Compaq, Netscape, Sun Microsystems, Google, and Amazon), Hudson River Trading, Capricorn Investment Group and Matter Venture Partners. SBVA, Bosch Ventures, TDK Ventures, Eclipse Capital, and VentureTech Alliance also added some money to the Eridu war chest.
With anywhere from $3 trillion to $5 trillion in AI hardware spending between now and the end of the decade, depending on who you ask and what you believe is either possible or likely, there is a lot of networking money up for grabs. Like anywhere from $600 billion to $1.5 trillion, depending on where you want to draw the lines. And that is enough revenue to engender innovation and to drive competition and still leave Nvidia the dominant supplier of compute and networking for AI systems in 2030.
Exactly how this high radix switch will be made, Perkins and Hassen are all smiles and little on the details. They offered this chart:
We presume the current comparison above pits 51.2 Tb/sec Ethernet switch ASICs, not the 102.4 Tb/sec devices that have been revealed and that will start appearing in products maybe later this year or early next year, against what Eridu is cooking up. But it will be much more fun if the comparison is for 102.4 Tb/sec devices.
“As you well know, GPU sockets are expanding in power and performance 10X at a jump, but networking has not,” Perkins explains, and that order of magnitude includes every trick in the book, including lower precision data and compute, tuning the software stack, and underlying hardware. “Networking is not keeping up, and has only been doubling in performance every silicon node, which is every two and a half to three years. And we don't think that is fast enough, and Eridu was founded to solve that problem.”
"We want to bring order of magnitude improvement to networking performance because you have to interconnect these GPUs," Perkins continues. "One GPU is great, but when you get into the hundreds of thousands or millions of them, you really have the compute power to have an artificial super intelligence. But the network has to be fast enough to keep up with these things. Otherwise, the GPUs sit idle. And the GPUs are idle two thirds of the time because they don't have a network fast enough to get data to them quickly enough.”
Psst. Don’t tell Nvidia or AMD GPU customers that . . . .
We don’t know what Eridu is precisely up to, but we know a few things.
First, Eridu is the oldest city in the Sumerian Empire in what is now southern Iraq, and it was founded about 7,500 years ago and flourished for many, many centuries. It was close to the Persian Gulf to the south and the Euphrates River to the north, just 15 miles downriver from Ur. Eridu was the city of Enki, the god of wisdom, craftsmanship, and fresh water, and his temple was Abzu, which apparently means “house of the deep waters.” What we might call in America “Stillwater” as in “still waters run deep.” Which leads us all the way back to intelligence, in a funny way. To extend the metaphor even more, Eridu existed before the Great Flood – yes, that one that Mesopotamian and Christian scripture share – and perhaps we exist before a great deluge of our own of artificial intelligence. Deep waters, indeed.
The chart above is unequivocal in that Eridu, the company, wants to have a new switch ASIC architecture and a new switch system design that allows for thousands of GPUs and XPUs to be interconnected in a scale up network and for millions of these compute engines to be interconnected in the millions in a scale out configuration of that same new switch ASIC and system. Whatever Eridu is doing is going to be pricey, but so is having heat sensitive optical transceivers linking Ethernet spine and leaf switches (or NVSwitches driving copper cables or copper backplanes directly).
Let’s do a thought experiment.
If you assume that the switches in the Clos fabric above have 64 ports running at 800 Gb/sec, then the leaf/spine network can interconnect 1,280 ports into 1,280 GPUs. So there are a few ways to get 1,280 ports out of what looks like a single box. The first is to build a modular switch with its own backplane and fabric interconnects. But this is costly, which is why we see hyperscalers and cloud builders using plain vanilla switches in leaf/spine setups in the first place.
You could make a massive switch socket, just like we are going to be making massive GPU and XPU sockets with four or more interconnected compute engines starting next year. The trouble with this “big socket” approach when it comes to networking is that you have a beachfront to area problem. As you increase the switch packet processing engine area, the ratio of the edges where you can park SerDes to talk to the outside world to the amount of packet processing goes down. Moreover, at a certain point, the cut-through latency across the switch socket gets to be too big. So this sucks.
But Hassen says after my rant about this that this is not all that sucks in this reticle-limited chiplet mashup approach.
“That is the second order issue," he points out. "The first order issue is that if you glue switch chips together, now the packet has to traverse multiple places and the packets have to get stored. So you really have to start with a clean sheet of paper and admit that you are not going to glue things together, that you are going to have a certain domain for the switching architecture that you build.”
When we noodle this, we come up with a waferscale switch ASIC, which neither Perkins nor Hassen would confirm or deny. But this makes sense despite the area-to-beachfront issue I raised above. Using small chiplets for anything increases yields like crazy, but it also increases power consumption because of the chip-to-chip interconnects. If you wanted to go completely for it, you might have stacked SerDes around the edges of the waferscale packet processing engines to make up for the reduction in beachfront. Or you might have very large SRAM caches, which are useful for managing flows in complex and large systems, to keep the packet processing capacity and SerDes bandwidth in balance.
That’s what I might try. We will have to wait until later this year – maybe at Hot Chips, maybe at some other event – to find out because Eridu is not talking about it right now.
The company did confirm that whatever it does, it will be Ethernet and it will run the appropriate protocols for memory fabrics and scale out networks. And after talking at length with Perkins and Hassen, we did not get the impression that it was an optical circuit switch like Google uses for its network backbone, including for its most recent four generations of TPU clusters.
Here is another hint that I got:
The single domain of the first Eridu rowscale switching system looks like it will support up to 5,120 GPUs or XPUs in a single-hop network, and will support over 1 million compute engines in a two-tier network. It is not clear if the memory atomics will work at that scale, but if it is useful for AI model makers, one can always hope.
What you also see in this chart is that even though Eridu says it can reduce the number of switches by a factor of 30X, it is only trying to lower the cost of switching infrastructure by 40 percent even as it reduces network power consumption by 70 percent. In other words, Eridu sees a straight line to how to make some fatty delicious profits.
I look forward to learning more, as I imagine you do as well.