It is refreshing to find instances in the IT sector where competing groups with their own agendas work together for the common good and the improvement of systems everywhere. So it is with the absorption of the Gen-Z Consortium by the CXL Consortium.
With this, CXL has emerged as the protocol of choice for linking CPU memory to the memory of accelerators and, interestingly enough, as the protocol that will be used to link any kind of compute engine to external memory resources. (Think about it: A memory bank can be thought of as an accelerator that just has some kind of dynamic or persistent memory on it and just a memory controller instead of full-blown CPU, GPU, FPGA, or custom ASIC.)
What could have been a reprise of the Bus Wars of the late 1980s and the early 1990s and what could have resulted in a hodge-podge alphabet soup of competing transports and protocols for providing coherent memory access between devices has instead, in a relatively short time of only two years, become a single protocol that the entire industry can get behind. And we think chip and system makers who are not particularly enthusiastic about CXL will have to get behind this standard, particularly given that Gen-Z is merging into the CXL Consortium and all of its intellectual property is now controlled by the CXL Consortium and all of its people will be working from one place on a common platform for both in-chassis interconnects, as CXL was formed to do, and across-chassis intraconnects, as Gen-Z was relegated to by the emergence of and market acceptance of CXL.
Intel’s Compute Express Link, or CXL for short, was late to the interconnect protocol party, having been announced in March 2019, several years after IBM’s CAPI and OpenCAPI, Nvidia’s NVLink and NVSwitch, AMD’s Infinity Fabric, Xilinx’s Coherent Cache Interconnect for Accelerators (CCIX), and Hewlett Packard Enterprise’s Gen-Z (embodying some of the ideas behind The Machine.) But being late was fine because in the ensuing years, as the PCI-Express transport got onto a two-year performance cadence of bandwidth doubling and got out of the stalled condition of seven years between generations 3.0 and 4.0, that gave the CXL protocol, which is without a question better than the PCI-Express protocol, a transport it could ride atop for the next few years until something from the silicon photonics world evolves and slides invisibly underneath PCI-Express and replaces it. Many of these other protocols will be relegated to CPU-to-CPU interconnects at best, and we think in the long run it would be interesting to see CXL emerge as a compatible CPU coherence protocol that would allow a standard across CPU architectures; it would be interesting to see CXL memory links that are akin to what IBM is doing with the OpenCAPI Memory Interface (OMI) with its Power10 processor. (We talked about these possibilities two months ago in The CXL Roadmap Opens Up The Memory Hierarchy.) The performance of CXL on PCI-Express 5.0 is there for this, as we talked about here.
That is a lot of ask of CXL in the short run, but you don’t get anything if you don’t ask for it. The momentum in the market for a coherent interconnect is clearly behind CXL, which has 15 board member companies and 43 contributing member companies, for a total of 58 members, and a total of 108 adopter companies. While the Gen-Z Consortium had 65 members, the CXL Consortium has a lot of adopters who are eager to see the PCI-Express 5.0 ports that are necessary for CXL to run to proliferate on CPUs and accelerators in the next year. IBM’s Power9 processors was the first to support OpenCAPI 2.0, NVLink 2.0, and PCI-Express 4.0 ports, and Big Blue’s Power10 processor is the first to support OpenCAPI 3.0 and PCI-Express 5.0, but interestingly does not support NVLink 3.0 (bad blood between Nvidia and IBM over the lost exascale supercomputer deals with the US Department of Energy is our guess) and can, in theory, support the CXL 1.0 and CXL 1.1 specs, which were about point-to-point interconnects between CPUs and accelerators. With the forthcoming CXL 2.0 spec – and here is where it gets really interesting for system architects – accelerators such as NICs and SmartNICs without much compute and no real memory aside from cache or accelerators with lots of compute and high bandwidth memory as well as memory buffer devices with lots of memory and no real compute will be able to be linked to hosts using a PCI-Express switched fabric and shared on that host or across multiple hosts.
We will need a memory hypervisor to manage this, of course, and VMware is stretching is ESXi hypervisor to become one, the Linux community is working on one, and MemVerge has laid the groundwork for one as well.
To make this all work across so many use cases and so many different possible mixes of devices and memories, the industry has to get behind one protocol and push hard.
Kurtis Bowman, who had been involved with server engineering and architecture at Dell for over 16 years and who joined AMD as director of server system performance back in May, was often the public face of the Gen-Z Consortium but was also on its board and reached out to Jim Pappas, director of technology initiatives at Intel and also chairman of the CXL Consortium, to merge the organizations.
“We have 80 percent common members, and having two organizations makes it very hard to make advancements on both fronts because we have to figure out where to put our investments,” Bowman told The Next Platform. “We have to figure out where to put our best people. We have to make sense of this for the strategy rooms and the boardrooms. And the bottom line of our conversation was that Gen-Z would cease to exist but all of the assets would transfer over to CXL and that gives CXL the opportunity to pick up all of the work that the Gen-Z members have done and use it as they deem appropriate. Because of the common membership between the two organizations, the Gen-Z members understand the CXL spec and they are building the Gen-Z pieces that they would like to see transferred over.”
“I’m the one who brought this idea back to the CXL Consortium board to discuss it, and no one had to twist my arm,” said Pappas. “There was not a lot of discussion, either. It is a pretty obvious thing to do and it was the right time.”
One of the very interesting things that Gen-Z was working on with the OpenFabrics Alliance (which controls the InfiniBand and Ethernet RDMA networking standard) and the Distributed Management Task Force (DMTF), said Bowman, was a memory manager across these diverse devices. “That’s not a trivial thing to do, and it is something that Gen-Z knew it needed to have done,” said Bowman, particularly since Gen-Z was being pushed as a server intraconnect across nodes, with the first big application being a remote memory server, which Dell demonstrated back in January 2020 and which more than a few hyperscalers and cloud builders were watching very closely. “As it expands out, CXL is going to need that. People will see that processors and accelerators support CXL and now there will be this memory fabric manager coming out of OFA, open source with the code on GitHub and can be used by anyone who wants to create a memory manager.”
And that memory manager, we think, can possibly be the heart of a new distributed memory hypervisor.
CXL and Gen-Z have been carving up the interconnect and intraconnect in the datacenter for the past couple of years, with Pappas declaring a truce in January 2020 and then the two organizations inking a memorandum of understanding about how they could work together in April 2020. At that time, we suspected that the silicon photonics chipsets created by HPE would be the Cadillac version of an intraconnect out to a memory SAN or memory server or memory pool – whatever you want to call it – and maybe even to linking the memories of servers directly to each other over a cluster rather than going through an InfiniBand or Ethernet networking stack. CXL could be the mount point for the Gen-Z silicon photonics, was the thinking, since CPUs would have CXL ports. The regular copper PCI-Express transport would be the Ford F150 truck version of the memory intraconnect, sticking with the metaphor. The big difference – and one that has to be masked from software – is that CXL provides coherence in the hardware, while Gen-Z is an intraconnect that expects for the coherence to be provided somewhere in the software stack. Hence, we keep harping on the memory hypervisor to mask all of this complexity. If applications just see memory and something in the middle keeps from it being corrupted, then all will be well.
This is not as easy as it sounds, and Pappas, who fought the Bus Wars way back then under the Intel flag, reminded us.
“Pooling of storage like disks and flash is easier to do because the storage protocols are designed to hide latency,” Pappas explained. “They are bandwidth devices and I/O devices, and they hide latency by, for instance, getting multiple outstanding requests in process at the same time. With memory, it runs so fast that there is no place to hide the latency, and that is the real problem.”
Even if we can’t get memory pooling this way for some time – call it three to five years before all the software catches up with the hardware and it is all transparent to operating systems and applications – it is fun to think about the possibilities and to see the industry get together to drive a common way to achieve it.
In the meantime, the hardware has to come before the software as it always does so the software engineers have hardware to exploit. So, now we need every CPU that has PCI-Express controllers to have CXL 1.1 ports; we need all compute engines – and we mean all compute engines – to have CXL 2.0 ports running atop PCI-Express 6.0 controllers in 2023 and all compute engines to have CXL 3.0 running atop PCI-Express 7.0 and whatever silicon photonics is actually commercialized at that time when they are in servers in 2025. (The PCI-Express specs will be finalized about two years before that at each release, of course.) That’s plenty of time to start doing interesting things now with PCI-Express 5.0, and amazing things later with the subsequent generations.