The dividing lines between system buses, system intraconnects, and system interconnects are getting more blurry all the time. And that is, oddly enough, going to turn out to be a good thing in the long run.
In the short run, this is a bit messy. There are a number of competing and complementary standards that span this middle ground between the processor and adjacent systems, many of which run atop the PCI-Express bus transport but which do more interesting things with it than just hanging storage or networking off the bus. Such as doing some form of memory sharing across devices, usually though some sort of coherency mechanism. Others are coming up with their own electrical or optical signaling.
These include the Compute Express Link (CXL) from Intel, the Coherent Accelerator Interface (CAPI) from IBM, the Cache Coherence Interconnect for Accelerators (CCIX) from Xilinx, and the Infinity Fabric from AMD. Other interconnects try to get around some of the limitations of bandwidth or latency inherent in the PCI-Express bus, such as the NVLink interconnect from Nvidia and the OpenCAPI interconnect from IBM. OpenCAPI, which is supported on Big Blue’s Power9 processors, relies on special SERDES communication units on the chip that run at 25 Gb/sec and that can support a variant of the CAPI protocol or the NVLink protocol to attach Power9s to Nvidia Tesla GPU accelerators that also support NVLink – and do so in a coherent fashion across these different devices. The Gen-Z interconnect from Hewlett Packard Enterprise links out from PCI-Express on servers to silicon photonics bridges and switches that hold out the promise a memory centric – rather than compute centric – architecture for systems. It can be used to hook anything from DRAM to flash to accelerators in meshes with any manner of CPU.
At this point, all of these interconnects but Nvidia’s NVLink and AMD’s Infinity Fabric has an independent consortium driving their specifications, and more than a few hyperscalers and vendors participate in multiple consortia to keep a hand in all of the different games. At some point, these may resolve into a smaller set of transports and protocols that achieve the collective goals of these interconnects. They may compete to the death. But it sure doesn’t look like it, not with Steve Fields, chief engineer of Power Systems at IBM who also spearheads OpenCAPI, and Gaurav Singh, corporate vice president at Xilinx and who spearheads CCIX, plus Dong Wei, standards architect at ARM Holdings and Nathan Kalyanasundharam, senior fellow at AMD, being four of the five members of the board of the new CXL Consortium, which was launched this week. Alibaba, Cisco Systems, Dell EMC, Facebook, Google, Hewlett Packard Enterprise, Huawei Technology, and Microsoft all jumped on the CXL bandwagon early, and together, these companies represent a big portion of the systems ecosystem when gaged by capacity sold or bought. Significantly, Nvidia has also joined up even though it does not have a seat on the CXL board.
The only problem that we see initially with CXL, which was shown off in detail at the recent Hot Interconnects conference, is that it is tied to the PCI-Express 5.0 protocol, which is not yet available. PCI-Express 4.0, which came out in 2017, is still only available with two processors – IBM’s Power9 and AMD’s “Rome” Epyc 7002 – and while we are all excited that the PCI-Express 5.0 spec is coming out sometime this year and PCI-Express 6.0 is expected to be ratified in 2021, it has taken far too long to get these faster buses into new chips.
(Hmmm. It is a pity that the I/O is not all in a central hub in a chiplet architecture that could swap the I/O out without messing up the cores . . . Oh wait, it is already with AMD’s Rome and will very likely be with IBM’s Power10, which definitely supports PCI-Express 5.0 controllers and will almost certainly have a chiplet architecture. Intel itself doesn’t expect to get products out the door supporting PCI-Express 5.0 until 2021.)
System builders and system buyers want to be able to have some kind of fast links and coherence between CPUs and various kinds of accelerators and storage class memories – and they want it yesterday, which is how we ended up in this alphabet soup in the first place.
Stephen Van Doren, an Intel Fellow and director of processor interconnect architecture at the chip maker, walked the bitheads at Hot Interconnects through the CXL architecture and talked about many of its finer points, but said that even though CXL would be aligning with the 32 Gb/sec PCI-Express 5.0 protocol, which is double what PCI-Express 4.0 delivers, CXL would also be “a key driver for an aggressive timeline” to PCI-Express 6.0, which will double up the transfer rate once more time to 64 Gb/sec. That is eight times the bandwidth of the standard PCI-Express 3.0 link. We still think that PCI-Express is still the main bottleneck in systems, and anything that can be done to accelerate more bandwidth and more interesting connectivity over this bus is much welcomed.
As we said above, CXL is Intel’s own twist on adding PCI-Express that will support both I/O and memory disaggregation – a kind of Holy Grail for system architects that essentially virtualizes motherboards to make compute, memory, and I/O malleable across clusters of components – as well as computational offload to devices such as GPU and FPGA accelerators as well as memory buffers and other kinds of devices such as SmartNICs, which are computers in their own right. CXL is a set of sub-protocols that ride on the PCI-Express bus on a single link that give it some new tricks. Take a look:
CXL.io is the easy one, and it is basically the PCI-Express transaction layer that is reformatted to allow for the other two sub-protocols to co-exist side by side with it. CXL.io is used to discover devices in systems, manage interrupts, give access to registers, handle initialization, deal with signaling errors, and such. The CXL.cache sub-protocol allows for an accelerator into a system to access the CPU’s DRAM, and CXL.memory allows for the CPU to access the memory (whatever kind it is) in an accelerator (whatever kind of processing engine it is).
“These three protocols are not necessarily required to be used in all configs,” explained Van Doren. “In fact, protocol subsetting is expected as part of the CXL ecosystem. And there’s basically three usage templates that track the relevant subset of usages that we expect to see.”
The first subset is called a Type 1 device in the CXL nomenclature, and it is for devices that want to cache data from the CPU main memory locally. In this case, the devices only have to employ the CXL.io and CXL.cache layers. With a Type 2 device, there is memory on the accelerator and you want an interplay between the CPU and the accelerator, so the CXL.io protocol is used to allow the CPU to discover the device and configure it, and then you use the CXL.cache to allow the processor to touch the device’s memory and CXL.memory in the opposite direction. The Type 3 device is a memory buffer, and in this case you need the CXL.io sub-protocol to discover and configure the device and the CXL.memory sub-protocol to allow the CPU to reach into the memory attached to your memory buffer. It is interesting to contemplate just how much memory you could hang off that CXL link in the right in the picture above – and what kind of memory it might be and how fast a PCI-Express 6.0 link might be to support it.
Drilling down deeper than the sub-protocols to the link layers underneath CXL, this is where it is really different from the PCI-Express protocol – and obviously intentionally so.
There are tradeoffs that have to be made in designing a link layer, explained Van Doren, and CXL is no exception.
“For the CXL.io protocol, which is very PCIe-like, we actually do run it through something that looks very much like the standard PCIe link layer,” said Van Doren. “But for the CXL.cache and the CXL.memory protocols, we actually have a very different link layer and we have an interface stack that does the multiplexing of the protocols further down, closer to the PHY. These two different types of linked layers are differentiated based on whether you have fixed framing or not. PCIe has dynamic framing, which is very useful when you want to send messages of widely varying size – anything from 8 byte transactions to 4 KB transactions. When you’re looking at CXL.cache and CXL.memory, you are in a cache coherent and memory semantic environment where all of the transfers are 64 byte cache line sizes. You have some control messages that go along with that, but the dynamic range of message sizes is quite constrained compared to what you see on PCIe. And this creates an opportunity to build a link layer based on fixed thinking, which can save a lot of latency.”
Van Doren said that the consortium will have a lot of partners with many different kinds of devices, and latencies will vary according to design and device type, but based on early designs the use of fixed framing for the CXL.cache and CXL.memory sub-protocols had an order of magnitude – 100 nanoseconds – lower latency when using fixed framing. “So even though CXL adds a little bit more complexity to the interface stack, we think the savings in latency are worth the investment and complexity.”
The one thing that everyone wants to know about these new in-system interconnects is how they are going to deal with cache coherency and allow for memory to be shared across devices. There was a lot of talk a few years back, when IBM’s CAPI and Nvidia’s NVLink were in development and rolling out that Intel would open up its QuickPath Interconnect (QPI) or its follow-on, Ultra-Path Interconnect (UPI), which is used to provide NUMA links between Xeon processors so they can share cache and main memory and present a shared memory space to operating systems.
“This is one of the aspects of CXL that I think we get the most questions about,” Van Doren said. “CXL has a cache coherency protocol that doesn’t look very much like the multi-socket cache coherency protocols that most folks are used to seeing. It doesn’t look like the one Intel does, the UPI multi-socket protocol, and it doesn’t look those of our competitors, either.”
The approach that Intel has taken is asymmetric, and this is not the first system architecture to do this, but it is the first one in a long time to do it. (Not to remind everyone, but IBM’s AS/400 systems had asymmetric multiprocessing decades ago, simply because the CPU was too precious of a commodity to burden with I/O and storage tasks. What’s old is new again.)
With symmetric cache coherency protocols, which typically link the memories of separate CPUs to each other but which can also be used to link memories of accelerators to each other (NVLink can do this across GPU accelerators as well as IBM Power9 processors), the compute engines are front-ended by a protocol caching agent and the memories are front-ended by a protocol home agent, and the interconnect – UPI, Infinity Fabric, Bluelink, whatever – connects them all together using a low-latency, high bandwidth transport layer, as illustrated on the left of the chart below:
In the case of an accelerator using the same cache coherency interconnect as the CPUs, the accelerator looks like any other CPU and its memory looks like any other memory block. But there is a drawback, said Van Doren, and that is that the cache coherency is going to be bottlenecked by the bandwidth between the device and the accelerator. Which means if you are not going to literally put UPI on the accelerator, and use something like PCI-Express instead as a transport, that much slower interconnect is going to be a real bottleneck. Moreover, every server processor maker has its own NUMA interconnect and it is highly unlikely that they will all agree to adopt one of then as a standard. (Although that would be very convenient, especially if we had socket compatibility across processors, too. Imagine how wonderful that would be. . . . )
So CXL gives up on strict CPU-style cache coherency with accelerators and uses an approach called biased coherency bypass, which is shown on the right side of the chart above. And in fact, there are two different biases, which are outlined here:
The accelerator coherence is created with two complete – and completely different – flows, and importantly software can be used to flip back and forth between these two modes, which are called host bias and device bias, respectively. The idea is to get the benefits of cache coherency out to an offload engine without having to pay some of the high prices for full-on cache coherency as implemented between CPUs.
“When you choose the host bias coherency protocol, everything from the accelerator literally has to get bounced through the CPU,” Van Doren explained. “The ordering point for even the memory lines on the accelerator is in that cache coherence agent inside the CPU. With the second flow, which we call device bias, the intuitive way to understand the way it works is that this flow essentially forces the CPU to interact with a memory as though it’s uncacheable memory. So the CPU can grab a copy of the data, but it’s not allowed to hold it in its caches for an extended period of time. What that means is that when my accelerator goes to access this memory, it is guaranteed that the CPU doesn’t have a copy. So there is no reason to go over to the CPU and check.”
Importantly, this is all going to be determined at the driver level, below the application and alongside of and in conjunction with the operating system. And it can happen on a page by page basis managed by that driver. Moreover, because of the asymmetric nature of the cache coherency, a CXL-enabled accelerator will not care one whit what coherency protocol is being used across the memories attached to the CPUs in the system. So with Intel CPUs, it can make use of UPI, with AMD CPUs, it can make use of Infinity Fabric, with Arm CPUs it can make use of CCIX, and with IBM CPUs it can make use of Bluelink. The other possible approach was to create some kind of universal symmetric protocol, or to try to get the symmetric protocols to bridge to each other. But that would be tough, and also yield a least common denominator compatibility with each individual symmetric cache coherency protocol, and we all know about how useful that would be in practice.
“The asymmetry was actually built in by design to try to make this more of an open ecosystem,” Van Doren elaborated. “And I think this is truth in advertising: Intel started down this direction because we wanted an ecosystem that was amenable to both our client and server CPUs. So this isn’t just a question of different vendors. It turns out if you want an open ecosystem where you can build an accelerator and plug it into both the server CPU and the client CPU without putting a whole bunch of extra gunk on the accelerator that the client doesn’t need, you need an interconnect that allows the server to have the server coherence engine, which is very big and scalable, and our client CPU gets to have a very lightweight, non-scalable coherence engine.”
This asymmetric approach is important for another reason: It is going to make memory disaggregation a lot easier. With a symmetric approach, any adjunct memory buffer in a system would need to have its own home agent and snoop filter to bring it into the shared memory space of the collection of CPUs in the system. But using CXL and its asymmetric approach, any memory buffer can make use of the home agents and snoop filters on the processors and do not have to have these electronics added in. As Van Doren put it, the symmetry needlessly complicates a device class – memory buffers – that should be simple to deploy.
And that brings us to the offload model and why Intel thinks we need cache coherence at all. With the exception of the IBM Power9 processors and the Nvidia Tesla V100 GPU accelerators, which have NVLink ports linking them all into a shared memory space, we don’t really have coherence between CPUs and accelerators, but Intel thinks that we do and CXL is about delivering that.
There has been an evolution in how the memories of CPUs and accelerators interact, as shown below:
In the beginning, which Van Doren calls the split physical address with I/O link, the accelerator was on the PCI-Express bus but there were two distinct virtual address spaces – one for the CPUs and one for each individual accelerator.
“If you had any data structures that were pointer-based, there was a lot of complexity and you would have to do data marshalling, which means take your pointer-based data structure, smash it down to an array-based data structure, copy it over, re-expand it. For every application, your developers would have to write these things. You could get some efficient data copies – if you looked at your efficiency on the wire, the bus would look pretty good – but application development was a really big pain in the butt.”
And so the industry created unified addressing or shared virtual memory, depending on the naming convention accelerator makers employed, allowing for virtual addressing to be shared across CPUs and accelerators. This made certain things easier, but you still have two distinct physical address domains and it is not one big pool of writeback memory for applications to play in. And while the complexity was hidden by doing a lot of copying data back and forth between CPU memory and accelerator memory, there was a lot of memory management because access was driven by page faulting.
The obvious fix, according to Intel, is to simply let the CPU access pages of memory in the accelerators, and for that you need coherency. And with asymmetric coherency, fine-grained memory control as is done between CPUs on a system is not the goal, but direct memory access is. Van Doren says that cache coherency protocols have a “bad habit” of leaving data in or pulling it to the wrong place, such as leaving it in the CPU cache instead of the accelerator memory where it will probably be more useful. The coherence bias added into the CXL protocol makes use of the fact that memory usage on accelerators – bet they memory buffers or computational offload – are pretty well defined and can simply keep data back in the devices where it will be used next, saving CPUs cycles that might otherwise be used to do the pushing. There is a bias flip and the data is moved from the CPU cache to the accelerator memory in one fell swoop. This approach, says Van Doren, is much more efficient than migrating pages back and forth as is now done.
All of this explains why all of the big players are joining the CXL Consortium, even Intel’s biggest processing and accelerator rivals. Everybody just wants for all of this iron to get along.
Good article! Lots of info.
I am put in mind of RDMA, and also of Infiniband’s lack of upper or on-the-side protocols to allow for something like, oh I don’t know, ARP.
2 typos:
* the ache coherency -> +c (cache instead of ache)
* it doesn’t look those of our competitors -> +like
So can accelerators access CPU memory? Very long bullshit here and IBM is no where in sight.