Moore’s Law might be slowing down CPU compute capacity increases in recent years, but the innovation has been coming at a steady drumbeat for the interconnects used inside servers and between nodes in distributed computing systems.
We got to thinking about this as the news that another dot release in the evolving PCI-Express 6.0 specification, which is being put together by the PCI Special Interest Group (PCI-SIG) that has controlled the PCI, PCI-X, and PCI-Express peripheral bus standards since Intel offered its Peripheral Component Interconnect bus as a standard when it was announced back in 1992. IBM, Hewlett Packard, and Compaq created their own bus standard back in 1998, called PCI Extended, or PCI-X, which like PCI was a half-duplex parallel bus – meaning all devices shared the bandwidth on that bus, and only one side of the link between two devices could talk at a time). With PCI-Express, which debuted in 2003, the bus was moved to a point to point interconnect between multiple devices using duplex serial connections with varying lanes between the devices – meaning both endpoints in a link could talk at the same time and using the full bandwidth of the pipes between them.
While PCI-Express has been successful, we were stuck at PCI-Express 2.0, with its relatively modest lane speeds and aggregated bandwidths across one, two, four, eight, or sixteen lanes, for more than three years, and PCI-Express 3.0 stuck around for another seven years after that. As we noted back in June 2015, when multicore processors were shifting into high gear with ever-increasing numbers of cores and when Ethernet networking was starting to get back on the Moore’s Law track of a two year – and sometimes a one-year – cadence of bandwidth improvements, the PCI-Express bus was stuck in the mud and not really delivering enough bandwidth for the kinds of heterogeneous computing that were evolving. The PCI-Express bus had become the new bottleneck.
The bus situation inside of systems is getting a lot better here in 2019. The PCI-Express 4.0 bus, which was introduced in 2017, came out first in IBM’s Power9-based Power Systems machines later that year and is now gradually making its way into X86 and Arm processors. With this update, a peripheral hanging off the PCI-Express controller on a CPU package can deliver 31.5 GB/sec of bandwidth (after error correction encoding overhead is taken out) using sixteen (x16) 16 Gb/sec lanes (that’s before the encoding overhead hit). That x16 slot is typically used to drive a high-end GPU or FPGA accelerator and is the high point for bandwidth with a PCI-Express peripheral. Anyway, that PCI-Express 4.0 x16 slot has nearly eight times the bandwidth per slot as the original PCI-Express 1.0 specification offered a decade and a half ago. (The encoding schemes were far less efficient with PCI-Express 1.0 and 2.0 than they are with the follow-ons.) The PCI-Express 5.0 specification, which doubled up the bandwidth again using 32 Gb/sec lanes was finalized on May of this year, and after encoding will deliver a smidgen over 63 GB/sec of bandwidth each way over an x16 duplex link when devices are presumably available sometime next year.
The ink is not really dry on the PCI-Express 5.0 specification and the PCI-SIG is going to double up the raw lane speed again, to 64 Gb/sec, and that will allow an x16 link to drive a tad over 128 GB/sec of bandwidth each way over the duplex serial links. This is a lot of bandwidth increase in a relatively short amount of time, and we have physicists and chemists to thank for this as they come up with new materials that are used in the PHY communication circuits in interconnects (not just PCI-Express, but also Ethernet and others).
Like the latest generation of Ethernet switch ASICs and the standards they are based on, the PCI-Express 6.0 standard will be moving to pulse amplitude modulation encoding – PAM4 encoding, to be precise – to cram twice as many bits into a signal than can be done with the conventional bit encoding that has been used to date. The forthcoming PCI-Express 6.0 specification will also have a low-latency forward error correction layer added to the protocol, which most interconnects have to add as bandwidths are increasing and the odds of dropping bits increases. FEC adds a little latency to all interconnect protocols, and it is not clear how this will affect PCI-Express. The PCI-SIG rolled out the initial PCI-Express 6.0 specification in June, and with the 0.3 release of the spec hitting here in October, it is confident that it can finish the full specification and have it ready to release to hardware manufacturers sometime in 2021, as planned. That probably means PCI-Express 6.0 devices could come to market anywhere from late 2021 through the first half of 2022.
With so many multichip architectures and heterogenous architectures dependent on the PCI-Express transport and layering other protocols such as CXL, CAPI, CCIX, and Gen-Z over top of the PCI-Express hardware – and we think there is a good chance that NVLink and maybe even OpenCAPI could be converged with some of these protocols, such as CXL, to create a single protocol for linking compute and accelerators to each other atop PCI-Express – having steady bandwidth increases like we are seeing is vital to system architecture.
Here is an interesting chart that the PCI_SIG put together showing the historical bandwidth trends for PCI, PCI-X, and PCI-Express buses over time since 1992. Unlike the numbers we discussed above and unlike the numbers we will plot in our own chart below in a moment, the PCI-SIG chart adds up the bandwidth going both ways at the same time in the full duplex link (an x16 link in the case of PCI-Express) and the chart also does not take out the encoding overhead on the bandwidth (which we did).
The lighter line shows what the bandwidth progression over time on the peripheral bus would be if it doubled every three years like clockwork. Obviously, the shift to the serial, duplex lane architecture with PCI-Express offered a dramatic improvement in bandwidth over what might have been expected, and importantly by adding point to point links, the bus was not shared as with the PCI and PCI-X buses. With those older buses, when you added a second card to the bus, it ate some of the bandwidth as you might expect and the contention between the devices to take over the bus to transmit data caused additional overhead. Adding many cards to a system caused much overhead. So the relative performance of PCI and PCI-X devices could be very low in heavily configured systems.
That 36 month performance doubling cadence is important because it is more or less what the outside edge of Moore’s Law is delivering for at least some processor vendors. Intel and IBM are at a three year cadence at this point, nowhere near the 18 months originally described by Intel co-founder Gordon Moore in his law-giving paper back in 1965 or even the amended law he put out in the 1980s that stretched the transistor cost halving to 24 months or so. The fact is, the Moore’s Law steps are getting bigger, even if the drops are the same over the long haul, it is a much bumpier decline in the cost of semiconductors and rise in transistor densities. And it is, quite frankly, jarring.
Of course, Moore’s Law was not running as slow as it is now back when the PCI family of buses were being introduced one and two decades ago, so this attempt to make the big delays in each subsequent update to the PCI-Express spec seem less bad is interesting and clever but it does not change the fact that we sure could have used bus speed and bandwidth increases on a Moore’s Law stepping when it was shorter steps with less lengthy strides. The good news is that the pace of the stepping is picking up just as the pace is slowing down for CPUs and is speeding up for networks. So there is a chance that some of the components in a system can be brought back into balance and true hybrid architectures can flourish in the coming years because they won’t need proprietary interconnects to make components talk fast to each other.
Take a look at this chart we ginned up to make the point:
This shows relative server performance since 1998, when the PCI-X protocol came out, plotted against the top-end Ethernet switch port bandwidth more or less of the same time, plotted against the practical top end of the PCI bus bandwidth at around the same time. We can argue about some wiggle in the dates and bandwidths here and there, and we can also argue about whether or not we should include the full duplex bandwidth instead of the max bandwidth going both ways. We could argue that we should be looking at aggregate PCI bandwidth per server, not the bus or an x16 bandwidth. If we had that data we would have plotted it, too.
Some people like to see log charts so we can see the details down on the lower left of the graph, so here that is at least:
That relative server oomph figure is a very rough one we have ginned up based on two-socket servers based on X86 server chips since the rest of the market is essentially noise in the data, and it takes into account increase instructions per core (IPC) as well as cores per socket to come up with the comparative performance over time.
The last time all three of these vectors were rising steeply at the same time were in 2010, and after that, PCI-Express and Ethernet both took a breather at their respective PCI-Express 3.0 and 40 Gb/sec levels and coasted for a while. Processors are still on a more or less annual cadence, with IPC and process tweaks even if they take 18 months, 24 months, or now closer to 36 months to double the transistor counts and halve the cost per transistor. We are perhaps optimistic about how Ethernet is going to jump from 100 Gb/sec to 200 Gb/sec to 400 Gb/sec to 800 Gb/sec in the coming years, but this is what the roadmaps look like from the major merchant chip suppliers and their switch customers.
It doesn’t look like PCI-Express is ever going to catch up with the pace of CPU performance increases, but at least in the coming years it is going to be keeping pace.