
While it has always been true that flatter networks and faster networks are possible with every speed bump on the Ethernet roadmap, the scale of networks has kept growing fast enough that the switch ASIC makers and the switch makers have been able to make it up in volume and keep the switch business growing.
With the GenAI explosion, there is a uniform desire among all the big AI players to move away from the proprietary InfiniBand technology controlled by Nvidia, taking all of InfiniBand’s features and grafting them onto a new-and-improved Ethernet that can scale further and do so in a much flatter network to create ever-embiggened AI clusters. A cool 1 million GPU endpoints is the ambitious goal of the Ultra Ethernet Consortium, and it is going to take much more capacious switch ASICs to get there.
Today, Broadcom, the industry leader in the merchant chip market that is getting heavy competition from Cisco Systems and Nvidia on the Ethernet front, is launching its “Tomahawk 6” StrataXGS Ethernet switch ASIC into a market that will snack on the 102.4 Tb/sec ASICs and look ahead to 204.8 Tb/sec and 409.6 Tb/sec Tomahawk 7 and Tomahawk 8 chippery that all of the big AI players have no doubt seen on the Broadcom roadmaps.
They are also probably eagerly awaiting rollouts of co-packaged optics to lower the cost and increase the reach of their network spines for these massive future networks.
This all stands in stark contrast to the enterprise market, where the move from 10 Gb/sec to 100 Gb/sec Ethernet for the back end and front end networks has proceeded at a snails pace for the past decade and a half. But the desire to keep AI inside the corporate walls by many and the pressure to extract data out of existing systems to drive AI could cause an accelerated – and unprecedented – adoption of faster Ethernet than historical trends might suggest. AI back ends could drive corporate front ends to adopt 100 Gb/sec, 200 Gb/sec, and even 400 Gb/sec Ethernet a lot faster than they might otherwise.
The good news is that there has never been a cheaper way to get 200 Gb/sec or 400 Gb/sec ports than Broadcom’s Tomahawk 6 ASIC thanks to its huge aggregate bandwidth. Cisco Silicon One and Nvidia Spectrum-X will be fast followers, and Marvell Teralynx, Xsight Labs X3 and X4, and eventually Huawei Technologies CloudEngine ASICs will catch up – and pretty much in that order – but this week, it looks like Broadcom is going to be first out the door with a 104.2 Tb/sec device.
Huawei will be particularly challenged because the United States has put export controls on switching ASICs as it has on GPU accelerators from Nvidia and AMD, and given the importance of networking for AI systems, there is no reason to believe that an exception will be carved out for capacious Ethernet ASICs. Since 2020, the HiSilicon chip division of Huawei has been restricted to using indigenous Chinese foundry Semiconductor Manufacturing International Corp, which is stuck at 7 nanometer processes at the moment, but working towards 5 nanometers and lower.
There will be some pretty serious first mover advantage here for Broadcom, all driven by economics, which is in turn driven by technology. It is perhaps helpful to put the Tomahawk 6 into perspective relative to its Tomahawk 5 predecessor.
The Tomahawk 5 chip, launched in August 2022, is the last of the monolithic Tomahawk chip designs and, significantly, was created and unveiled to the world before the GenAI boom started in November 2022.
The need for bandwidth, low latency, and high radix for AI training and inference applications – not just training – pushed the Tomahawk 6 design, Peter Del Vecchio, product line manager of the Trident and Tomahawk switch lines at Broadcom, tells The Next Platform. But so did the practicality of the overall Ethernet market, which is evolving at different speeds for different segments.
The Tomahawk 5 chip was the only monolithic chip that delivered 51.2 Tb/sec of aggregate bandwidth; all of the others used chiplets, which wrap multiple signaling SerDes chiplets around a monolithic packet processing engine. Broadcom held out as long as it could to keep the heat and cut-through latency across the networking engine as low as possible, and probably paid the price a little bit on chip yields. The Tomahawk 5 was etched in 5 nanometer processes from Taiwan Semiconductor Manufacturing Co and implemented 512 SerDes, each supplying a lane of traffic running at 100 Gb/sec after the encoding overhead is removed. To be precise, the chip had a native signaling rate of 50 Gb/sec and then used PAM4 modulation on the signal to double-pump two bits of data for each signal, yielding that 100 Gb/sec effective data rate. The switches based on the Tomahawk 5 could implement 64 ports at 800 Gb/sec, 128 ports at 400 Gb/sec, and 256 ports at 200 Gb/sec officially.
We thought at the time the Tomahawk 5 was launched that having 512 ports running at 100 Gb/sec – a very high radix indeed, thereby allowing for very flat networks with a fairly small number of spine switches – was an interesting concept. We are not sure if anyone actually implemented this. . . .
Anyway, the Tomahawk 5 delivered a 100 Gb/sec lane of signaling for less than 1 watt, and included cognitive routing to help accelerate AI workloads. Importantly, the SerDes in the Tomahawk 5 was created to drive active copper links out to four meters as well as pluggable optics and the co-packaged optics used in the “Bailly” variants of Tomahawk 5 that have been tested at a number of hyperscalers and cloud builders in the United States and China.
Perhaps more importantly, as has been the case with prior switch ASICs from Broadcom and every other switch chip maker, every time you double the aggregate bandwidth on the device, you can make a switch device with a single chip do the job of six chips running at half the bandwidth to deliver the same number of ports running at the same speed. (You basically create a small leaf/spin network inside of the device to create a non-blocking network inside of the switch box.) This collapsing obviously radically cuts the cost per port, even when a single N generation ASIC costs a lot more than an N-1 generation.
This equation – N ASIC = 4 * (N-1) leaf plus 2 * (N-1) spine – is the magic of doubling aggregate capacity in each successive network ASIC generation in the network racket while at the same time eliminating complexity and cost. This equation is also why hyperscalers and cloud builders want Tomahawk 7 right now when it is not coming for two more years and are salivating over the prospect of a Tomahawk 8 maybe four years from now, which they also wish they had now.
Broadcom’s presentation says that the Tomahawk 6 chip uses 3 nanometer processes, and that means it is the N3 process from TSMC. We are a bit surprised that both the packet processing engine and the SerDes that wrap around it are both etched in 3 nanometer processes. We would have guessed that the central packet processing engine is etched in TSMC N4 (4 nanometer) or N3 (3 nanometer) processes, but that the signaling SerDes were etched using an advanced N5 (5 nanometer) or maybe an N4 process. It is more difficult to shrink I/O chips than it is compute chips, and this is as true for switch ASICs as it is for CPUs that have their I/O broken out separately in chiplet designs. We want to confirm that both SerDes and packet processing chiplets are etched using N3. (And Broadcom just confirmed that all parts are 3 nanometer.)
The Tomahawk 6 comes in two flavors, as you see in the right of the chart above. One has 512 SerDes – four chiplets with 128 SerDes each – that has native 100 Gb/sec signaling with PAM4 modulation to deliver an effective 200 Gb/sec signaling rate on each lane. If you gang up eight of those lanes to make a port, you get to 64 ports running at 1.6 Tb/sec.
Another set of SerDes for the Tomahawk 6 runs at the prior 100 Gb/sec signaling rate per lane – that’s 50 Gb/sec plus PAM4 modulation like the Tomahawk 5 SerDes – and delivers a whopping 1,024 lanes wrapped around the Tomahawk 6 packet processing engine. At eight lanes a port, you get 128 lanes running at 800 Gb/sec, which is twice as many ports as the Tomahawk 5 could drive off of a single ASIC at the same speed. If you wanted to drive 128 ports running at 800 Gb/sec using Tomahawk 5, you would need six chips in a baby leaf/spine setup inside of the switch and you would add extra hops inside of the switch instead of one across a single Tomahawk 6 ASIC.
“Everyone we know – OEMs, ODMs, hyperscalers, and cloud builders – we are getting so much pressure to get Tomahawk 6 out to market,” says Del Vecchio. “They are all telling us they absolutely have to be the first to market with GPU clusters based on Tomahawk 6. So there is a tremendous amount of engineering effort going on right now. We will see your standard pizza box form factors, where people moved away from chassis to pizza boxes cabled up with DAC cables and optics a while ago. But we are also seeing now is that with Tomahawk 6 is people trying to get these AI clusters as efficient as dense as possible, and with Tomahawk 6 we are seeing use both with the scale out networks as well as the scale up networks.”
We are going to drill down into the scale up networks based on Broadcom Ethernet separately, but suffice it to say that using 200 Gb/sec links, Broadcom says that it can link up 512 XPUs into a single shared memory image using Tomahawk 6.
The scale out story looks similar to the scale inside of a switch for a given number of ports, as you might imagine:
This chart above talks about 128,000 XPUs, but it is really 131,072 XPUs in the scale out cluster when you multiply it out. Here is what the two-tier Tomahawk 6 network to link that many XPUs together would look like and how you would need a three-tier network for any 51.2 Tb/sec Ethernet ASIC – including Tomahawk 5 – to interconnect the same 131,072 GPUs with 200 Gb/sec ports between all devices:
Del Vecchio says that this is an example with one 200 Gb/sec link per endpoint, and that for higher bandwidth, clusters would normally increase the number of planes. So, for instance, if you want 800 Gb/sec of total bandwidth to an endpoint, you would multiply all numbers of switches at each layer by four; if you want to reach 1.6 Tb/sec, which is akin to what Nvidia is doing with NVLink 5 ports, you would have to multiply by eight.
As you can see, the number of switches is a lot higher in the three-tier network than the two-tier network, at a factor of 3.3X, and this is pure cost. Perhaps even more importantly, the number of optical transceivers used in the super spine and spine layers is 1.7X higher in the older ASICs with only 51.2 Tb/sec switching capacity, and Del Vecchio says that about 70 percent of the power of the overall network is burned with those optics. Power is money, and the larger the number of optics, the higher the chances of having a failure that will stop the AI processing. The upshot is that a two-tier network using an N generation of ASIC burns about half the power of a three tier network based on an N-1 generation of ASIC.
You can see why the hyperscalers and cloud builders want to roll Tomahawk 6 out as soon as possible for their state of the art AI clusters, which are spanning to 100,000 GPUs and beyond. Del Vecchio says that the OEMs will probably have their products ready in the first quarter of 2026 and deployments in the second quarter of 2026, but everybody is trying to do it faster if they can.
Be the first to comment