To one way of looking at it, a reprise of the Bus Wars from days gone by in the late 1980s and early 1990s would have been a lot of fun. The fighting among vendors to create standards that they controlled ultimately resulted in the creation of the PCI-X and PCI-Express buses that have dominated in servers for two decades, as well as the offshoot InfiniBand interconnect, which was originally intended as a universal switched fabric to connect everything at high bandwidth and low latency. It perhaps took longer than it might otherwise – it is hard to rewrite history.
But what we can tell you – and what we have discussed for the past several months – is that the warriors who lived through those Bus Wars have learned from the history they created, and they don’t have any stomach for a protracted battle over transports and protocols for providing memory coherence across hybrid compute engines. The reason is that we don’t have time for that nonsense, and with the economic very likely heading into recession globally, we have even less time for such shenanigans and ego.
This is why the same week we hosted The Next I/O Platform event in San Jose last September – a happy coincidence, not planning – all of the members of the key coherency efforts – the Compute Express Link (CXL) from Intel, the Coherent Accelerator Interface (CAPI) and OpenCAPI superset from IBM, the Cache Coherence Interconnect for Accelerators (CCIX) from Xilinx, and the Infinity Fabric from AMD, the NVLink interconnect from Nvidia, and the Gen-Z interconnect from Hewlett Packard Enterprise and backed heavily and early by Dell – all got together to back Intel’s CXL protocol interconnect, itself a superset of PCI-Express 5.0, for linking processors to accelerators and sharing their memories. We did a very deep dive on CXL here at the same time, which was again a coincidence. Call it a harmonic convergence.
CXL and its coherent memory interconnect were designed to link processors to their accelerators and memory class storage within a system, and Gen-Z was primarily designed as a memory fabric that could have lots of different compute engines hanging off it, sharing great gobs of memory of various kinds. There was no rule that said CXL could not be extended out beyond a server’s metal skin to provide coherent memory access across multiple server nodes in one, two, or maybe three racks (as PCI-Express switching interconnects like those from GigaIO are doing, for instance). Similarly there was no rule that said Gen-Z could not be used as a protocol within a server node. Well, the rules of economics, perhaps, suggest that CXL will be cheaper than Gen-Z, mostly because of the silicon photonics that is involved with long-haul coherence. (We drilled down into the Gen-Z switching chippery that HPE has cooked up back in September 2019 as well, and have also reviewed the Gen-Z memory server that the Gen-Z consortium is prototyping to create pooled main memory.)
But, under a memorandum of understanding signed by Kurtis Bowman, president of the Gen-Z consortium and also director of server architecture and technologies at Dell, and Jim Pappas, chairman of the CXL consortium and also director of technology initiatives at Intel, these two potentially warring camps have buried the hatchets. Or, more precisely, they didn’t even buy the hatchets to later bury them. They have decided to not have any hatchets at all and to work so that CXL and Gen-Z can interoperate and interconnect where appropriate. To be specific, they have put together a joint working group to hammer out the differences try to keep these technologies on a coherent path. (Which seems appropriate, if you think about it.)
“You have probably seen that image of two train tracks coming from two different places trying to link, and they miss,” Pappas tells The Next Platform. “This working group is about making sure that the train tracks come together and they actually attach.
Both Bowman and Pappas have talked to us within the last few months in the stories mentioned above about using CXL as a potential mount point for Gen-Z fabrics inside of servers, and we suspect that this is how they will carve it up until long after Gen-Z silicon is out in volume. The higher bandwidth silicon photonics of the HPE Gen-Z chipsets will be the Cadillac version of linking servers out to memories (either chunks of raw, aggregated memory or the memory inside of accelerators or coprocessors) and the CXL ports will be the Chevy version, with shorter range and lower bandwidth.
“Both protocols have memory in their DNA,” explains Pappas. “With CXL, you have got limited reach because it is basically PCI-Express. But with Gen-Z you have much greater distances – rack, row, and even datacenter someday. But at the end of the day, you have a coherent interface into the CPU, which CXL is, and then Gen-Z lets you take it all out to a wider distance and make it work.”
There is no talk of trying to merge the transports or protocols, but we would not be surprised in the long term if CXL just ends up being a protocol running on a variety of silicon photonics transports, including but not limited to Gen-Z. At some point, PCI-Express will not be the system bus, so CXL cannot be tied indefinitely to that bus. But, this is many, many years from now.
There are some differences in the way CXL and Gen-Z work that will require coordination so they can speak to each other properly.
“CXL is a true hardware coherent design and application software does not have to be aware of it at all,” explains Bowman. “Gen-Z does not have built in hardware coherency and so it is looking for a software coherency model and it uses things like atomics to enable that. If your applications want to go across a large memory space and share specific regions, the software will have to be written to take advantage of that. The reason we can’t do Gen-Z coherency in hardware is that the snoop cycles between the machines hooked to the memory would consume most of the fabric bandwidth. You can get the best of both worlds. If you need a hardware coherent interconnect, CXL is the way to go, and if you need a fabric to share resources within a rack or across rows, then Gen-Z is the way to do. But that coherency will be done in software.”
The question we have is who is going to be blazing the CXL and Gen-Z trails. One of the answers is, of course, Microsoft and its vast Azure public cloud. Leendert van Doorn, distinguished engineer for the Azure cloud at Microsoft, is a board memory of both the CXL and Gen-Z consortiums and he shared some of his thoughts on how this will work.
“I look at CXL as a local, in-node interconnect,” says van Doorn. “And once you start looking at disaggregation – specifically at a rack level – you need to go to something else because PCI-Express 5.0 doesn’t extend that far out. Yes, you can build retimers and there are people who have built fabrics based on PCI-Express in the past, but they have run into some challenges. These are challenges that Gen-Z has solved and addressed, and what we are talking about now is really the interface between CXL and Gen-Z. Now, will we see disaggregation at a smaller scale within a node with CXL? Yes, absolutely. We expect to see that. But the real diaggregation benefits come into play when you can scale it up against a rack or hopefully across an entire row of machinery. And that’s when you need to look at a different kind of fabric technology.”
So how might Microsoft make use of CXL and Gen-Z in its infrastructure? Van Doorn gave some hints, and they are consistent with what we have been thinking. We have rack-based flash modules sharing data across NVM-Express to a rack of servers and we can have rack-based main memory (persistent or dynamic, we don’t care) sharing data across Gen-Z, for instance.
“About 50 percent of my server cost is memory,” says van Doorn. “And so if I could more efficiently use that memory, sharing it across systems, there is a huge benefit there. So you could envision systems that have a certain amount of memory on the node itself and then another chunk of memory that is shared across the servers and is pulled from a pool depending on demand.”
It will all come down to latencies. DRAM memory accesses are around 80 nanoseconds inside of a node, and with a fast NUMA interconnect the typical access for far memory is around 135 nanoseconds, says van Doorn. And while the effects of that can be masked by the operating system kernel to a certain extent, they are noticeable. And that is why pooled main memory outside of the rack, even with Gen-Z, will be more challenging because of the additional latencies they will add – on the order hundreds of nanoseconds to milliseconds, depending on the range. But with other persistent media, which is inherently slower than main memory, because of the added device latency of that media, it will be easier to use a protocol like Gen-Z to span many racks, as well as rows and even further because the Gen-Z fabric interconnect will be so much faster than the devices. And the Gen-Z latencies are so much lower than for Ethernet or InfiniBand that this should be the fabric of choice – in the long run – for row and datacenter scale storage interconnects. That’s the theory anyway.
In the end, the interplay of latency and fungibility will play out, and we think that once these protocols are hardened and devices are out there, people will try to put as little memory and storage in a server node as they can get away with and create composable pools to extend it. This will be the best economic and technical sense.
The other question is when will CXL and Gen-Z just be normal parts of the datacenter infrastructure? Gen-Z development kits are available for consortium members now and Bowman says early adopters will be coming online in 2022 and it will be mainstream by 2023 to 2024. For CXL, the silicon is imminent, according to Pappas, and it will be going through debugging throughout 2020 and it is usually about a year after that before it is put into systems.