Nvidia Weaves Silicon Photonics Into InfiniBand And Ethernet

When it comes to networking, the rules around here at The Next Platform are simple. When it comes to hyperscale networking for massively distributed, largely not coherent applications, the rule is: Route when you can, and switch if you must. For HPC and AI workloads, which are both latency and bandwidth sensitive, we stick with the older adage: Switch when you can, route if you must. And when it comes to networking cabling, we go with: Copper when you can, fiber when you must.

Nothing illustrates this last principle better perhaps than the vast copper cabling making up the backplane of Nvidia’s rackscale GB200 NVL72 system, which is comprised of 36 MGX server nodes, each with two of its “Blackwell” B200 GPU accelerators paired with a single “Grace” CG100 Arm server processor into a shared memory compute engine cluster that has 36 CPUs and 72 GPUs all pulling the AI rope like a team of Clydesdales. The system uses NVSwitch 4 interconnects to create that CPU and GPU memory fabric, and requires over 5,000 big fat copper cables, driven directly by the NVLink 5 SerDes running at 224 Gb/sec. Because all of this communication is inside of a rack, copper cables are sufficient if cumbersome to provide cooler, high bandwidth pipes between the GPUs, which have the CPUs hanging off them.

As we have said in the past, we think eventually Nvidia will have co-packaged optics on its GPUs, and maybe even between the GPU compute chips and their HBM memory to offer configuration flexibility and lower component density without sacrificing latency and performance because heat density, driven by the need to lower proximity between the components, is forcing Nvidia into liquid cooling.

The math is easy. Every time you double the bandwidth on a copper wire, you double the garbage on the line so you can only get a clean signal on half as much wire length. When – not if – Nvidia doubles the bandwidth with its NVLink 6 ports with its next generation “Rubin” GPU accelerators, this means that it will only be able to span a half rack of GPUs, and if they also run hotter, it might be significantly less than a half rack. This is clearly not a goal, and it is the best argument for switching to CPO on the GPUs and maybe even the CPUs for NVLink 6 ports on the future “Vera” CPUs, too. If anything, Nvidia will want to increase the NUMA domain for GPUs by a factor of 2X or 4X, but cut it in half, as AI inference workloads grow.

But today, as Nvidia’s GPU Technical Conference 2025 gets under way in full swing with the keynote by company co-founder and chief executive officer Jensen Huang, is not the day for CPO on the GPUs or on the banks of HBM3E or HBM4 memory that attach to them. Which we crave on your behalf.

It is, however, the day when Nvidia will reveal its plans to adopt silicon photonics and to deploy co-packed optics (CPO) with its Quantum InfiniBand and Spectrum Ethernet families of switches, and that is not only an exciting development, but as it turns out will go a long way towards cutting the power requirements for networks in datacenter-scale AI systems.

The power consumed by optics in the network is enormous and so is the capital expense. Anecdotally, we have heard it said many times that the majority of the cost in a datacenter-scale cluster is in the optical transceivers at both ends of a link and the fiber optic cable between them. Some the pieces that link switches to network interface cards is 75 percent to 80 percent of the cost of a network, with the switches and the NICs making up the other 20 percent to 25 percent. Which sounds crazy when you say it out loud.

Gilad Shainer, senior vice president of marketing for networking at Nvidia, prebriefed The Next Platform about the silicon photonics and co-packaged optics for switches, and so did Ian Buck, vice president and general manager of hyperscale and HPC at the company. Two charts from them give us some insight into the problem that datacenter operators are facing when they use optical links to cross-connect their the servers and storage in their datacenters. (Like they have a choice until now. . . . )

Here’s the one from Shainer:

And here’s the one from Buck:

Let’s synthesize this. In Buck’s chart, which is based on a datacenter using server nodes with two CPUs for every four GPUs in a server node (like the GB200 NVL72 rackscale MGX system design), has 100,000 servers in a datacenter and therefore 400,000 GPUs. (If you used an HGX design, which would not fully connect the memories of the GPUs within a rack but only within a server node, you would have one CPU for every four GPUs and would only need 50,000 servers to house 400,000 GPUs, but it would take half as much space and slightly fewer optical transceivers. But it would take twice as much space.)

In any event, the scenario that Nvidia chose would have 2.4 million optical transceivers, the pluggable modules that go into every server port and every switch port to convert electrical signals to optical ones that can be pushed over a fiber optic pipe. These 2.4 million transceivers use 40 megawatts of power, and the lasers on these pluggable modules account for 24 megawatts of this.

In a “traditional” hyperscaler and cloud datacenter, which has a Clos topology instead of a full fat tree topology like an AI or HPC supercomputer, you burn about 2.3 megawatts on transceivers, which if you work the numbers backwards is just shy of 140,000 of these pluggable modules. The reason the transceiver number is so low is simple: A server with one or two CPUs doing web infrastructure or even search engine crawling, has one port while a GPU server needs no fewer than one port for each GPU. The sheer number of compute engines in an AI supercomputer drives the use of optical transceivers.

And, it gives the industry a perfect excuse to get rid of them, which Nvidia is doing with its next generation of Quantum-X InfiniBand and Spectrum-X switches, and presumably will eventually do with its Connect-X SmartNICs and BlueField DPUs, and as we point out above, the NVLink ports on GPUs and CPUs and NVSwitch memory atomic switches.

As you can see in the chart above, Nvidia has two different approaches to co-packaged optics that it is using, which was developed in conjunction with a slew of partners in the fine print at the bottom of the chart. The silicon photonics engines were created by Nvidia itself (Mellanox has plenty of expertise in making pluggable optics) and a new design for micro-ring modulators (MRMs) was created for these switch ASICs to have their optics integrated.

The move to 200 Gb/sec signaling lanes in 800 Gb/sec ports may have been the push that came to shove. Just getting the signals from the switch ASIC to the ports on the panel was going to take a large number of signal retimers (perhaps as many as two per port), and as the financials for Astera Labs shows, the costs “sure do mount up,” as actor Charlie Sheen once put it.

Nvidia has also worked with fab partner Taiwan Semiconductor Manufacturing Co to optimize its own photonics engine design, and includes high power (and high efficiency) lasers and detachable fiber connectors.

Let’s zoom in on this chart above for a second.

As you can see on the left, the future Quantum-X InfiniBand ASIC with CPO has a monolithic switch ASIC chip with six distinct CPO modules, each with three connectors for a total of what looks like 18 ports running at 800 Gb/sec but it is really 36 ports (each plug seems to have two ports).

Clearly, this InfiniBand smaller CPO module is designed to be low cost and manufacturable at high yield. This is a first step, and it is not going to result in a switch with high radix and therefore it will take lots of them to interlink a certain number of GPU ports through NICs on the servers.

The Spectrum-X with CPO has a multichip design for the Ethernet switch ASIC has a monolithic packet processing engine wrapped by eight SerDes chiplets (two per side) and then four unknown chiplets at the corners doing we do not know what. Each side of the Spectrum-X CPO chip has nine ports, for a total of 36 ports running at 800 GB/sec.

In both designs, the SerDes are running at 224 Gb/sec per lane, with four lanes to make a port and 96 Gb/sec in total being lost to encoding overhead to give you a net 800 Gb/sec per port. The SerDes on the Quantum-X ASIC have a total of 72 lanes, and the SerDes on the Spectrum-X chiplet collection have 144 lanes.

With all of the hyperscalers and cloud builders wanting to use Ethernet for the back-end networks for their AI clusters and most other neoclouds and many HPC supercomputing centers expected to follow suit, Shainer focused on the benefits that accrue to Spectrum-X.

First, let’s see how Nvidia is doing its CPO packaging and what the effect is. Then we will take a look at the switches themselves.

Here is how it is done, schematically, and what the power savings are by using integrated optics and using an integrated laser source in the switch, interestingly the math was shown for 1.6 Tb/sec ports, which is in the future of the datacenter not the present:

As you can see in the chart above, the digital signal processor on the optical transceiver burns 20 watts and the externally modulated laser that provides the light source for the transceiver runs at 10 watts. So there is 30 watts across 2.4 million transceivers for cross-connecting 100,000 servers and 400,000 GPUs. When we do that math, we get 72 megawatts, not 40 megawatts (which may be the 800 Gb/sec port number).

With CPO, you have a continuous wave laser source in the switch box that burns 2 watts per port, and the optical engine integrates with the same substrate that the Spectrum switch ASIC uses, and it burns 7 watts. So now you are down to 9 watts per port, and across 2.4 million links, which brings it down to 21.6 megawatts. By our math, that is a 3.3X reduction in power from the links.

Not only does the power go down by using CPO, but because there are fewer transitions between components for the signals, there is less noise in the overall end to end configuration. Take a look:

Every time you jump from component to component, you create signal noise, and with the pluggable optics attached to the switch, you have five transitions across the transceiver and the switch printer circuit boards, substrates, and port cage, which gives you 22 decibels of signal loss in total. With CPO, you have one transition in the substrate to link the switch ASIC to the silicon photonics module, which is a signal loss of only 4 decibels. That’s a factor of 5.5X lower noise on the signal.

Here is the effect of all of this:

The chart above says you can have 3X the number of GPUs in the same optics power envelope, but as we see above (and as you can discern from the chart with your own eyes), it is really 3.3X. Significantly, the number of lasers needed to connect any given number of GPUs together will also drop by more than 4X. The trick, of course, is to have laser sources inside the Quantum-X and Spectrum-X switches that are easily replaceable in the field in case they fail, or reliable enough to not have to worry about that. Hence, the Quantum-X and Spectrum-X switches with CPO will be liquid cooled, which allows for them to run cooler and not have their lasers get all whacky.

At the moment, Nvidia has three different switches planned as it rolls out co-packaged optics.

The first is Quantum 3450-LD, which will have four of these Quantum-X CPO sockets inside the box, fully connected in a non-blocking manner, to deliver 144 ports running at 800 Gb/sec for 115 Tb/sec of aggregate effective bandwidth across those ports. (We wonder why it is not six, which you need to present four ASICs to the panel, with two of them used to network the four ASICs together.) This Quantum-X switch will be available in the second half of 2025.

Here is the Quantum-X switch with its full-on cable moustache:

The two Spectrum-X switches using CPO are going to take longer to get into the field, and are not expected until the second half of 2026.

The first Ethernet switch from Nvidia with CPO is the Spectrum SN6810, which will have a single Spectrum-X CPO device and deliver 102.4 Tb/sec of aggregate bandwidth for 128 ports running at 800 Gb/sec. (There are clearly some extra CPO units on the package to increase the packaging yield.)

The Spectrum SN6800 switch is a big bad, with 512 ports running at 800 Gb/sec, with a total of 409.6 Tb/sec of effective aggregate bandwidth across the four ASICs inside the box. We again wonder why there are not six Spectrum-X CPO ASICs in the box instead of four, to cross-connect them in a non-blocking fashion, but Shainer said there was four. We will try to get an explanation of how this works.

Perhaps the switches with four ASIC sockets are not connected at all, but more like sleds in a multi-node server?

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

  1. >That’s a factor of 5.5X lower noise on the signal.

    I think you should convert decibels to a linear scale as shown below:

    10^((22-4)/10) = 63.1

    I think 63x is correct, not 5.5x.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.