The System Bottleneck Shifts To PCI-Express

Timothy Prickett Morgan

7 years ago

No matter what, system architects are always going to have to contend with one – and possibly more – bottlenecks when they design the machines that store and crunch the data that makes the world go around. These days, there is plenty of compute at their disposal, a reasonable amount of main memory to hang off of it, and both Ethernet and InfiniBand are on the cusp of 200 Gb/sec of performance and not too far away from 400 Gb/sec and even higher bandwidths.

Now, it looks like the peripheral bus based on the PCI-Express protocol is becoming the bottleneck, even with advances such as NVM-Express. Or, perhaps more precisely, because of such technologies as NVM-Express for linking flash and soon other non-volatile, persistent storage like 3D XPoint, more directly to the CPU-main memory complex and the attachment of GPU and sometimes FPGA accelerators for compute offload for massively parallel or specialized functions, the PCI-Express bus is getting overloaded.

It is a bit like playing Whack-A-Mole. But the vendors that comprise the PCI-SIG organization that creates and commercializes the PCI-Express bus are not just sitting there are Moore’s Law advances compute, storage, and system interconnect networking at an aggressive pace. The PCI-Express 4.0 protocol hit its Revision 0.0 specification in June of this year, and will be first deployed by IBM in its Power9 processor this year. Neither the new “Skylake” Xeon SP processors from Intel nor the new “Naples” Epyc processors from AMD, both launched within the past month, have support for PCI-Express 4.0, which doubles up the bandwidth over PCI-Express 3.0. But there will no doubt be kickers, probably available next year, that will offer the updated PCI-Express protocol.

Equally importantly, the members of the PCI-SIG are working to double up the performance of the PCI bus again with PCI-Express 5.0, which is slated to have its specification completed by 2019. In fact, back in early June, when the PCI-Express 4.0 protocol spec running at 16 GT/sec was put out in “feature complete” form and was awaiting final approval from vendors to get the Revision 1.0 nod to go to market, the Revision 0.3 specification for PCI-Express 5.0, which runs at 32 GT/sec, was put out at the Revision 0.3 level and is progressing.

This quadrupling of bandwidth over PCI-Express 3.0, which has a raw bit rate of 8 GT/sec, is necessary so that single-port 400 Gb/sec Ethernet and InfiniBand adapter interfaces can be driven by a single PCI-Express x8 port and not require a jump in networking to x16 ports. GPU and FPGA accelerators need the added bandwidth of PCI-Express 4.0 and 5.0 across the lanes in an x16 port as they will cram an incredible amount of compute and, we presume, larger chunks of HBM memory onto their packages. And, provided that the copper wires are not too short as they support ever-faster signaling, the more capacious PCI-Express available in the future will enable much more tighter coupling of compute nodes using raw PCI-Express switching, and that presents some interesting possibilities and obviates the need in some cases for Ethernet or InfiniBand as a cross-system interconnect on distributed systems.

Just like Ethernet stalled in the jump from 10 Gb/sec to 100 Gb/sec and had to take a ministep at 40 Gb/sec, the peripheral bus hit a region of low air pressure in the jump from PCI-Express 3.0 to PCI-Express 4.0. The flat spot in the exponential curve is pretty obvious in the PCI-Express roadmap:

For those of you who like the numbers, here is the evolution from PCI 1.0 through PCI-Express 5.0 in terms of bandwidth and the frequency of the traffic lanes on the bus:

The reason that PCI-Express 4.0 took so long – almost twice as long as usual and longer than any PCI bus jump – is simple: It is really hard to keep driving up capacity on all devices, and no semiconductor process, be it for compute, networking, or storage, can double forever at a lower price per unit of capacity without hitting at least a few snags.

“It gets harder and harder to do each iteration,” Al Yanes, president of the PCI-SIG, explains to The Next Platform. “If you go back to what we said four or five years ago, we have learned a lot, and people have developed different schemes and different ways of doing the PHY. They are using different materials, such as Megtron4 and Megtron6, on the printed circuit board, and are using different connectors. They have also come up with new ways to reduce crosstalk and electrical discontinuities, and with reducing the margining and putting this all together is what makes us fairly confident that we can do PCI-Express 5.0 at 32 GT/sec. People squeeze here and there and innovate.”

The connectors for PCI-Express 5.0 will be a little bit tighter and will have shorter keys and use different materials, but importantly will be backwards compatible with prior PCI-Express 3.0 and 4.0 engineering. That means manufacturers don’t have to change their server motherboard, switch motherboard, or adapter card designs even as they shift to different materials like Megtron4 and Megtron6, which are more expensive but which reduce signal losses, and devices with different tolerances.

The compatibility of PCI-Express 4.0 and 5.0 will be particularly important, since it is fair to say that PCI-Express 4.0 will probably have short life in the field. If IBM rolls out Power9 sometime before the end of 2017, it won’t be shipping it in volume until early 2018. We don’t expect for a lot of PCI-Express 4.0 devices to come to market until Intel’s Xeon SP line supports it. We already know that the “Kaby Lake” Xeon cores do not have PCI-Express 4.0 controllers, but one could be retrofitted into the Kaby Lake variants of the Xeons when they come out around July 2018 or so, bit it might take until the “Cannonlake” Xeons come to market in maybe May 2019 before we see PCI-Express 4.0. A lot depends on how much hay IBM can make with a lead in PCI-Express 4.0 lead.

Generally speaking, there has been a lag of about a year between the PCI spec being finished and PCI products coming to market. The PCI-Express 4.0 spec took a long time to develop, and it is really just now being completed but products have been designed as the spec got closer to completion so IBM won’t be launching Power9 without any peripherals that use the faster protocol. With PCI-Express 5.0, the spec should be done in 2019 and the products should hit the streets in 2020. Yanes says that the move to PCI-Express 5.0 is going to be a bit faster than usual because the PHY is just going to be tweaked and put out the door with minimal changes to the protocol just to get the speed out there.

As for PCI-Express 5.0, which will be using more expensive materials, Yanes says that for a few slots on the server, perhaps to drive networking or GPU and FPGA accelerators, server makers and their customers will be willing to pay a premium for a few of those slots. If we were designing server chips, the lesson here might be to have a PCI-Express 4.0 controller on the chip, but even in 2020 and 2021, maybe it only makes sense to have PCI-Express 5.0 on the package, like Intel is doing with its Omni-Path interconnect on selected Skylake Xeons. We could end up with systems that not only have backwards compatibility, but which literally support two types of controllers at the same time in the system.

Then again, with non-volatile memory and fast networking becoming vital to distributed systems performance, the industry could just try to jump straight to PCI-Express 5.0 in 2020 with tepid support for PCI-Express 4.0. We think this is not likely. The bandwidth demands are too high, to put it bluntly, and in fact, if PCI-Express 4.0 was not delayed so much, it would not have been necessary to create technologies like NVLink in the first place. Nvidia could have just put PCI-Express 4.0 ports and switching and a coherent protocol on the “Pascal” and “Volta” GPUs and have been done with it. The same PHYs that do the 25 Gb/sec “Bluelink” OpenCAPI ports on the Power9 chips are being used to implement NVLink 2.0 and PCI-Express 4.0, after all. In a way, all of the ports on Power9 are PCI-Express 4.0 running different protocol layers on top.

PCI-Express 5.0 uses the same 128/130 bit data encoding scheme that was delivered with PCI-Express 3.0 and 4.0 and, with only 1.5 percent overhead, much better than the 8/10 encoding used with PCI-Express 2.0. With the faster PHYs and the skinny encoding, a 400 Gb/sec Ethernet or InfiniBand port or a dual-port Ethernet or InfiniBand device running at 200 Gb/sec will require 50 GB/sec of bandwidth in both directions on the wire. With PCI-Express 4.0, that would take 16 lanes of traffic (an x16 slot) to deliver around 64 GB/sec, which is sufficient to handle 50 GB/sec of bi-directional traffic. But when PCI-Express 5.0 doubles that up again, this will be accomplished with an x8 slot and that means the form factors will be as they are now for 100 Gb/sec Ethernet and InfiniBand.

If you look at the 400 Gb/sec Ethernet products coming from Innovium and Mellanox Technologies, it would be nice if PCI-Express 5.0 was here next year instead of two years from now. Switching is getting back in synch with compute, but the PCI bus is lagging, and that is not a good thing considering how many things that are moving very fast now hang off of it.