While the long overdue upgrade to PCI-Express 4.0 is finally coming to servers, allowing for high bandwidth links between processors and peripherals. But perhaps the most exciting use of this new bandwidth will be with PCI-Express switched fabrics linking aggregations of coprocessors like GPUs and FPGAs and storage like blocks of flash to compute complexes in a composable way.
This is the moment that GigaIO Networks, which we profiled back in October 2019, has been waiting for. The increase in bandwidth from the PCI-Express 3.0 switches it was selling running its FabreX composable fabric software last year were interesting, but the possibilities for larger aggregations of devices or higher bandwidth links between devices (or a mix of both) makes the concept of PCI-Express disaggregation and composability all that more attractive to customers who are looking to not leave any capacity stranded in their systems. GigaIO was hinting about the possibilities of PCI-Express 4.0 switching during our HPC Day event in November last year.
The PCI-Express 4.0 switching silicon from Broadcom (formerly PLX Technologies) and Microchip are coming out of the foundries now, and GigaIO is plunking them down into its FabreX Switch and server adapter cards to goose their oomph. Now, we get to see how companies will make use of the technology that GigaIO has created. The FabreX FX/OS operating system and composability software just sees a faster switch, which can be used to create twice as many more ports (48 per switch) running at the same speed as the PCI-Express 3.0 switches, or the same number of ports running at twice the bandwidth. PCI-Express 3.0 uses 8 Gb/sec of signaling per lane, and PCI-Express 4.0 doubles that up to 16 Gb/sec per lane. The switches can aggregate the lanes to make ports of various speeds and obvious different levels of radix, as they say in the networking business. (The Broadcom chip is used for the server adapter, and the Microchip ASIC is being used in the switch.)
“This gives people the ability to really tune in the performance or the scale within a rack or a handful of racks of equipment,” Alan Benjamin, co-founder and president of GigaIO, tells The Next Platform. “And for customers already using FabreX, they can combine PCI-Express 3.0 switches with PCI-Express 4.0 switches and it all runs in the same FabreX software environment.
One of the benefits of PCI-Express switching, and what makes it a good choice for composable infrastructure as well as coherent attachment of coprocessors (as Intel is doing with CXL and as IBM has done with CAPI on the Power architecture), is the low latency compared to other kinds of transports. But, as Benjamin cautions, mileage varies when it comes to latency, and it is not as simple as a cut-through measurement on a port-to-port hop across the switch.
“It really depends on how you use the switch,” explains Benjamin. “It is typically port to port if you are doing a cut-through measurement, and we can have the argument as to how useful that is. But the cut-through measurement, port-to- port on this PCI-Express switch silicon is 50 nanoseconds. You will see the PCI-Express switch vendors typically quote 125 nanoseconds to 150 nanoseconds through the switch to be to be conservative, which we find to be generally pretty much right. And then, you know, there is latency across whatever software you start adding on top of it. There is so much variation depending upon what people are trying to do, and so we now just say customers are going to get sub-microsecond latency. We can show cases where it is 350 nanoseconds or it is 500 nanoseconds or it is 750 nanoseconds, but unless your software is really bloated, it is going to be less than 1 microsecond.”
Like other switch makers, GigaIO is not doubling up the price as well as doubling up the bandwidth, and that is because customers always expect a lower cost per bit transferred with each technology iteration. In this case, the PCI-Express 4.0 FabreX switches cost about 30 percent more and deliver either 2X the ports, 2X the bandwidth per port, or some mix if customers need something in-between with custom hardware. This is consistent with the price increases we have seen with the merchant Ethernet silicon providers at the switch level, and for obvious reasons. If PCI-Express switched fabrics are going to take off, vendors like Broadcom and Microchip and their downstream switch partners like GigaIO have to pass through the savings to customers to spur adoption. There may be insatiable demand for bandwidth, but there is generally only a level or slightly increasing budget. Higher radix switches are in vogue because, for any given number of devices that need to be connected, fewer switches are requires to interconnect everything. And it is not like we have PCI-Express 4.0 GPU accelerators yet and of all the FPGAs out there, only Intel’s Stratix 10 DX device plugs into PCI-Express 4.0 ports. (We had been expecting for Nvidia to be launching its “Ampere” Tesla GPU accelerators this week, possibly with PCI-Express 4.0 variants as well as NVLink 3.0 connectivity, but that didn’t happen because the GPU Technology Conference in San Jose was canceled thanks to the coronavirus outbreak.)
So the question is whether companies will just stick with PCI-Express 3.0 peripherals for now and make use of the higher radix in the FabreX switches to attach more devices to attach more devices.
“That’s my bet,” says Benjamin. “Most people are going to use it at 128 Gb/sec performance and hook up twelve devices. But, as soon as I say that, someone is going to want to use 256 Gb/sec performance and do half that. And for customers that are using 25 Gb/sec Ethernet for connectivity right now, running 24 ports at 64 Gb/sec out of a PCI-Express switch is going to be a big boost in performance. And, of course, you can always cascade switches to create larger topologies, and we have customers that are doing that, too.”
There are, of course, competing composable fabrics that are being implemented across Ethernet with the RoCE implementation of RDMA as well as InfiniBand as well as a few that are making use of PCI-Express interconnects, and there is a lot that needs to be sorted out when deciding which way to go.
“What it comes down to is the latency of the device that you are trying to compose,” says Benjamin. “If you start with rotating drives, the latency is a couple of hundred microseconds or more and it really doesn’t matter what network you use because the network is far faster than the device. If you go to a standard NVM-Express flash drive, the drive is going to be in the range of 30 microseconds to 50 microseconds to even 80 microseconds. If you have got a transport latency of 15 microseconds to 20 microseconds, you are still pretty good. So what we say is if you are trying to virtualize NVM-Express storage, then Ethernet does a pretty good job, and if you need higher performance, you pick switches that support RoCE. But if you go to some of the newer NVM-Express devices, that have a 10 microsecond to 15 microsecond latency for the device, or 3D XPoint from Intel or Micron Technology and you are down to 1 or two microseconds, or GPUs and FPGAs where you are sub-microsecond, you can’t take 10 microseconds or 15 microseconds of transport latency. And this is where people start looking hard at PCI-Express fabrics. There is nothing that anyone can do with software to reduce hardware latency – you can screw it up, but you can’t make it better.”
The big hyperscalers and cloud builders, as it turns out according to Benjamin, are finding out exactly how the latency of those giant Clos Ethernet networks they have to support disaggregated banks of GPUs and FPGAs is a problem. And it would not be surprising to see them build their own or buy PCI-Express fabrics.
Now, with PCI-Express 4.0 switch chips shipping, all eyes turn towards PCI-Express 5.0, which will double the bandwidth up again, and PCI-Express 6.0, which will double it one more time. Benjamin says he expects first silicon for PCI-Express 5.0 switches to start sampling in the first half of 2021, with general availability of the switches in late 2021 to early 2022. That’s not a huge gap, considering how long it takes server makers to implement each PCI-Express speed jump.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Huh, “some of the newer NVM-Express devices, that have a 10 nanosecond to 15 nanosecond latency” shouldn’t this be microseconds?
Yes, that’s correct. Brain fart. Thanks for catching.
a powerful aspect of it afaict, would be enhaced ability to split large tasks into streams using multiple different available resource pools, but tune the resources dynamically so the streams keep pace with each other – all streams complete concurrently e.g.