Aside from all of the buzz that optics get in datacenter networking, copper is still king of the short haul. The reason is simple: Those optical transceivers that link servers to switches and switches to each other over the short-haul row and long haul datacenter distances are crazy stupid expensive. And they also fail in the field enough to screw up massive HPC simulation and AI training runs.
And so the trick to technical and economic success is to only use optics when you absolutely have to and to stick with copper wiring and devices that can drive it as directly as possible under whatever circumstances there are given the systems used.
Nothing illustrates this principle better than the DGX GB200 NVL72 supercomputer node launched by Nvidia in March, which uses its “Grace” CG100 Arm server CPU and a pair of “Blackwell” GB100 GPU accelerators and lashed together with 5,184 massive copper cables, which are used to interlink 72 of the Blackwell GPUs in an all-to-all configuration. The 200 Gb/sec SerDes in the nine NVLink Switch 4 switches at the heart of the NVL72 system can drive the 1.8 TB/sec NVLink 5 ports directly, over copper wires, without the need for retimers and certainly without the need for the optical transceivers used in long-haul datacenter networks.
At the launch, Nvidia co-founder and chief executive officer Jensen Huang said that the NVL52 system weighed in at 120 kilowatts of power, but if Nvidia had used optics, the retimers and optical transceivers would have added another 20 kilowatts to the power budget for the NVL72 system. He did not say what it would cost, but we think optics would have significantly increased the internetworking code for the rackscale system – maybe doubling it – and also increased the chances of failures at the node level.
Broadcom, which is the volume leader in peddling merchant silicon into the datacenter switching market and which has a sizable network interface card business based on its “Thor” family of NIC chips, pays very close attention to both copper and optics and their cabling, too. And for the same cost and reliability reasons that compelled Nvidia to make the choices it did for the NVL72 compute node mentioned above.
“A 4,000 node cluster will have 9,200 optics across its interconnect,” Hasan Siraj, head of software products and ecosystem at Broadcom, tells The Next Platform, comparing and contrasting InfiniBand with optics and Ethernet without them. “Every hyperscaler will tell you, and indeed every customer will tell you, that they fail, and the failure rate is up to 5 percent. But even if you have a 2 percent failure rate in such a cluster, you are looking at about 15 optics failures per month. And while InfiniBand might be a lossless fabric, it is inherently becomes lossy because these optics are failing. And InfiniBand has another problem in these failures. Compared to Ethernet, it will take 30 times longer for it to recover from these failures just because it is a static fabric. InfiniBand has a Unified Fabric Manager, but you have to go back and find out what the next route is. Whereas with Ethernet, it is inherently a dynamic fabric – you have Border Gateway Protocol (BGP) and Bidirectional Forwarding Detection (BFD) and other capabilities, and we have done things in the silicon to recover from these hardware failures within 10 nanoseconds. All of these things help improve the AI job completion time.”
The point is, you want to avoid going back to a checkpoint and reloading the system state from that point and then re-run the HPC simulation or AI training run from there. And you want to avoid as many optics in the network design as possible to eliminate failures, and that means driving Ethernet ports directly out of the NIC and using direct attach copper (DAC) cables wherever possible.
This is what Broadcom’s “Thor” line of NIC chips is all about, which have just been updated.
The “Thor 1” ASIC was launched in early 2010 and started sampling that fall. Thor 1, which was etched using 16 nanometer processes from Taiwan Semiconductor Manufacturing Co, was used in network adapters that plugged into PCI-Express 4.0 server slots and that had SerDes that drove native 56 Gb/sec signaling that combined with PAM-4 encoding and after encoding overhead was taken off could drive a pair of 100 Gb/sec ports.
The “Thor 2” NIC chip, which like the NVLink Switch 4 ASIC will be able to drive copper cabling directly off the SerDes on the ASIC, was launched in 2022, sampled in 2023, and is now shipping. Because low power is such an important factor in AI networks given the scale of these machines – from 20,000 to 60,000 GPUs in the largest clusters we have heard about – Broadcom has really pushed the power envelope on the Thor 2 NIC chip by shrinking down to 5 nanometer TSMC processes. (This is small enough to yield a big power savings, but not the expensive 4 nanometer or 3 nanometer processes that are not yet mature or cheap.)
Here is a comparison chart of the Thor ASICs from 2022, when the Thor 1 chips had been in production for two years and Broadcom was hinting a bit about the Thor 2 lineup:
As you can see, the Thor 2 chip SerDes has native 112 Gb/sec signaling with PAM4 encoding on top, and after that encoding overhead is taken off is able to drive a single 400 Gb/sec port, a pair of 200 Gb/sec ports, or a quad of 100 Gb/sec ports. All of the Thor chips thus far have multihost capabilities, which allows for the bandwidth on uplinks to be split across two or four hosts. The Thor chips are also guaranteed long technical and economic lives, as you see, with Thor 1 available at least through 2036 – yes that is a dozen years from now – and Thor 2 well beyond that, but as yet undetermined.
The Thor 2 adapters plug into PCI-Express 5.0 slots. Broadcom is happy to sell hyperscalers, cloud builders, HPC centers, and anyone else whole adapter cards, or just ASICs, or even IP from the designs for those who want to create their own NICs.
Here is what the single port Thor 2 BCM957608-N1400G adapter looks like, which supports one 400 Gb/sec port:
And there is the Thor 2 BCM957608-P2200G adapter, which drives two 200 Gb/sec ports:
Siraj says that with Thor 2, Broadcom will be the only NIC provider that support linear pluggable optics, which is just a fancy way of saying that the NIC can drive the optics directly if you need to use optics to drive longer distances in the datacenter. But the Thor 2 can also directly drive copper cables that are up to 5 meters in length directly, with most of the NIC competition, according to Siraj, only being able to drive copper cables that are 2.5 meters in length. That cuts down power requirements a lot – by as much as half the competition, says Siraj.
The Thor 2 chip also supports RoCE v2 RDMA, which is the analog to the RDMA built into InfiniBand but running atop Ethernet. Siraj says that with the Thor 3 ASIC, Broadcom will be adopting Ultra Ethernet Consortium technologies and driving the port speed up to 800 Gb/sec. Thor 3 is expected to be launched next year, more or less in synch with the expected delivery of PCI-Express 6.0 peripheral slots in servers.
With Thor3 Broadcom will hit the competition with heavy HAImmer. I am sure HAIllywood is watching.
Based on Thor 2’s schedule, with Thor 3 launched in 2025, does it ship in 2027?
Probably more like late 2025 to early 2026.