Just How Bad Is CXL Memory Latency?

Conventional wisdom says that trying to attach system memory to the PCI-Express bus is a bad idea if you care at all about latency. The further the memory is from the CPU, the higher the latency gets, which is why memory DIMMs are usually crammed as close to the socket as possible.

Logically speaking, PCI-Express is miles away. And as PCI-Express bandwidth doubles with each subsequent generation, without the help of retimers that also add latency, the distance it can travel diminishes as well. This isn’t a big deal for most kinds of memory that we are used to attaching to PCI-Express. It is not uncommon for flash storage to have latencies measured in the tens of microseconds, making a few hundred extra nanoseconds incurred by the interconnect a moot point. However, DDR and other forms of volatile memory aren’t so forgiving.

Previous attempts at memory expansion have been mired with compromise, especially with respect to latency. For instance, GigaIO says its FabreX architecture can already do memory pooling across PCI-Express using DMA, but doing so requires applications that can tolerate latencies of 500 nanoseconds to 1.5 microseconds.

Similarly, before Intel unceremoniously axed its Optane persistent memory business this summer, deploying the tech meant incurring roughly 350 nanoseconds of latency, according to our sister site Blocks and Files. While usable, especially in a tiered-memory configuration, that’s considerably more than the sub-100 nanosecond roundtrip latency you’d expect from DDR memory that’s attached directly the CPU.

Enter The CXL Memory Ecosystem

This brings is us to the first generation of memory expansion modules using the Compute Express Link protocol, or CXL. Systems based on AMD’s Epyc 9004 “Genoa” processors are among the first, boasting 64 lanes of CXL connectivity– distinct from its 128 to 160 overall PCI-Express lanes – that can be divided up to anywhere from four to sixteen devices. As for how Intel will implement CXL on its “Sapphire Rapids” Xeon SP processors, we’ll have to wait until they arrive early next year.

Complimenting these servers are the first of what we are sure will be many CXL memory expansion modules. While it is true that CXL will eventually allow for full disaggregated systems where resources can be shared throughout the rack over a high-speed fabric, those days are still few years off.

For its first foray into the datacenter, CXL is squarely focused at memory expansion, tiered memory, and some early memory pooling applications. For the moment, we are just looking at memory expansion because at this early stage it’s arguably the simplest and most practical, especially when it comes to attaching memory at usable latencies.

Samsung and Astera Labs have already shown off CXL memory modules they say can add terabytes of memory to a system simply by slotting them into a compatible PCI-Express 5.0 slot. From a system perspective, they look and behave just like regular DDR DRAM memory that is attached to an adjacent socket over the memory bus.

For the longest time, once you reached the limits of the CPU’s memory controller, the only way to add more memory was to add more sockets. If the workload could take advantage of the extra threads, all the better, but if not, it becomes an awfully expensive way to add memory. In effect, the extra socket is just a memory controller with a bunch of expensive, unwanted cores attached to it.

Memory expansion modules behave in much the same way, but rather than using proprietary socket-to-socket interconnect, like Intel’s UPI or AMD’s xGMI link, it’s CXL. And this means you can have a whole ecosystem of these devices, and in fact we’re already seeing a rather vibrant, if at times aspirational one, take hold around CXL.

CXL actually encompasses three protocols and not all of them are silver bullets for latency, CXL president Siamak Tavallaei told The Next Platform at SC22. “CXL.io still has the same kind of latency as you expect (from PCI-Express), but the other two protocols — CXL.cache and CXL.mem — take a faster path through the protocol, and they reduce the latency.”

How Bad Is The CXL Memory Latency Really?

If the folks at Astera are to be believed, the latency isn’t as bad as you might think. The company’s Leo CXL memory controllers are designed to accept standard DDR5 memory DIMMs up to 5600 MT/sec. They claim customers can expect latencies roughly on par with accessing memory on a second CPU, one NUMA hop away. This puts it in the neighborhood of 170 nanoseconds to 250 nanoseconds. In fact, as far as the system is concerned, that’s exactly how these memory modules show up to the operating system.

Most CXL memory controllers add about 200 nanoseconds of latency, give or take a few tens of nanoseconds for additional retimers depending on how far away the device is from the CPU, Tavallaei explains. This is right in line with what other early CXL adopters are seeing as well. GigaIO chief executive officer Alan Benjamin tells The Next Platform that most of the CXL memory expansion modules it has seen are closer to 250 nanoseconds of latency than 170 nanoseconds.

However, as Tavallaei points out, this is still an improvement over four-socket or eight-socket systems where applications may have to contend with multiple NUMA hops just because they need the memory. (Although, to be fair, IBM and Intel have added more and faster links between CPUs to reduce the hops and the latencies per hop.)

With that said, many chipmakers are quick to point out that the CXL ecosystem is only now getting off its feet. AMD’s Kurtis Bowman, who serves on the CXL board of directors, tells The Next Platform many of the early CXL proof of concepts and products are using FPGAs or first-gen ASICs that haven’t yet been optimized for latency. With time, he expects latencies to improve considerably.

If CXL vendors can, as they claim, achieve latencies on par with multi-socket systems outside of show-floor demos, it should largely eliminate the need for application or operating system-specific customizations necessary to take advantage of them. Well, at least as far as memory expansion is concerned. As we’ve seen with Optane, CXL memory tiering will almost certainly require some kind of operating system or application support.

This couldn’t come at a better time as sockets grow larger and fitting more DIMMs on a board is getting harder and harder. There are just fewer places to put them. There are dual-socket systems with room for 32 DIMMs, but as chipmakers add more channels to satiate the bandwidth demands of ever higher core counts, this isn’t scalable.

We are already seeing this to some degree with AMD’s Genoa chips, which despite boosting the number of memory channels to twelve, only supports one DIMM per channel at launch, limiting the number of DIMMs in a dual-socket configuration to 24. And even if you could attach two DIMMs per channel, we are told fitting 48 DIMMs into a standard chassis would be impractical.

As we look to attaching memory at longer distances, across racks for instance, things get more complicated as latency accrued from electrical or optical interconnects must be factored into the equation. But for in-chassis CXL memory expansion, it appears that latency may not be as big a headache as many had feared.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

6 Comments

  1. Ok, so only a single digit times higher latency. Now, if youre incredibly latency tolerance and embrace the disaggregated concept, more than a few CPU’s and GPUs will be contending for access to the aggregated memory. How will latency react to contention? Is it hard to build a cxl switching complex that can sustain tens of GBps from numerous hosts at the same time?

  2. Looks like on-chip HBM + off-chip CXL memory pools + a DSA or DPU could be orchestrated to prefetch the anticipated memory blocks, perhaps under direct user control, or perhaps using some ai, analogous to the branch predictors.

    • We might need local DDR until PCI-Express 6 or 7. But in a funny way, I think flash has to be the new tape, and we need to figure out a way to add a lot more memory to the CPUs. A lot more.

      • Memory performance is gated primarily by memory device design, organization and access not the interconnect used to access it. Increasing interconnect speed provides modest incremental improvement which is why processor vendors have focused on increasing the number of DDR memory channels at great cost and increased system complexity. This increased cost and complexity were the primary reason behind using serial interconnects to augment and perhaps one day replace DDR. Unlike DDR, serial interconnects rely on a media controller to abstract memory–underlying device type, number, access, etc. In essence, a media controller splits off the lower half of a memory controller which allows memory devices, functionality (caching, accelerator, etc.), mechanical module packaging, etc. to be optimized in ways that processor memory controllers could never support. As such, media controller innovation will ultimately drive system performance which will enable processor simplification as large memory channel pin counts and memory controller complexity are eliminated along with the accompanying power consumption. Ultimately, the interconnect itself won’t matter that much as everything is data access is once again simplified to be basic reads and writes.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.