Rest In Pieces: Servers And CXL

June 15, 2023 Timothy Prickett Morgan Connect 11

If you had to rank the level of hype around specific datacenter technologies, the top thing these days would be, without question, generative AI, probably followed by AI training and inference of all kinds and mixed precision computing in general. Co-packaged optics for both interconnects and I/O comes up a lot, too. But with so many compute engines with their backs pressed up against the memory wall, talk inevitably turns to increasing bandwidth and capacity for CPUs, GPUs, and FPGAs and lowering the overall cost of main memory that gives the systems of the world a place to think.

And that leads straight to the CXL protocol, of course.

Rambus has a long history of innovating in the memory and I/O arenas, and it is a player in both HBM memory and CXL extended and shared memory for systems. And thus, we had a chat recently with Mark Orthodoxou, vice president of strategic marketing for datacenter products at Rambus, about the implications of CXL memory pooling and memory sharing on impending and future server designs.

Among the many things that we discussed with Orthodoxou was the fact that the PCI-Express switch makers – the PLX division of Broadcom and MicroChip – are far too late coming to market with their products, and are lagging the speed of the PCI-Express slots in servers by 18 months to 24 months, depending on how you want to count it. If we are going to live in a CXL world to attach CPUs to each other, main memory and expanded memory to servers, and CPUs to accelerators, we need switches to come out concurrent with server slots or there isn’t much of a point, is there? Grrrr.

And so, we have strongly encouraged Rambus to start making PCI-Express switches and get things all lined up. You can lean on them in the comments in this story, too.

Carl Schumacher says:

June 15, 2023 at 3:57 pm

“…We need to make sure this is broadly useful” (close to a quote near the end of the video), and yeah, I was thinking back to my SAN days and instead of SunServerN gets permanent access to this “LUN”, its webserverN gets semi-perm access to DIMM 172 in a “memory server”.

But I have to ask, for the general case is all this external-to-the-server memory infrastructure worth the cost?…In my somewhat recent 10(ish) years of “server sitting” for a ~4K Xeon/Epyc server based financial research/training cluster, yes the job scheduler took into account available RAM on a given server when assigning jobs and I am sure little bits of RAM may have been “wasted” (unused) for short periods of time, but all this CXL plumbing is not without cost. RAM sharing at the rack level? Local RAM + access to a per-rack RAM-box for short-term extra needs? Hmmm right up there with all the extra per-rack costs for non-single-phase immersion cooling.

Maybe I’m just trying to hold onto to my data center go-to (2U / 4-node dual socket servers) too long, and like LLMs, a brave New World awaits.

…Or maybe this is all just a “hardware hallucination” for all but the Cloudiest of providers.

Reply
Stpehen Channell says:

June 15, 2023 at 4:17 pm

Surely an advantage of CXL memory is that it can be shared between {CPU, GPU, FPGA} that don’t share cache coherency, so best suited to use-cases that can exploit lockless concurrency

Reply
- Eric Olson says:
  
  June 16, 2023 at 10:20 pm
  
  Is this the same partnership that brought us Rambus memory during the Pentium 4 era?
  
  I don’t see the attraction of more memory with less bandwidth and more latency when the balance between processor and RAM speeds is already so far out of balance.
  
  Reply
Sandeep Pathak says:

June 15, 2023 at 9:00 pm

Utilising unused resources from multiple clusters is another challenge that needs to be addressed.

Reply
8^b says:

June 16, 2023 at 9:22 am

Hopefully, we can get to CXL 3.0 on PCIe 6.0 soon (with RamBus switches), as this is where I would think CXL will become more useful, and properly take-off (sharing the data meal, organically, between multiple cephalopodic hosts; breaking bread, and cheese, amongst guests, with coherence). CXL 2.0’s (PCIe 5.0) okay for a single, centralized, cephalopodic memory server, and a few guests, but 3.0 is where the full-tilt boogie hits the cafeteria, and the whole disaggregated system gets to swing as a single gastronomic composition (in my perception) (yummy!).

Still, yes, decoupling CXL from PCIe (discussed towards the end) could be the greatest thing since sliced bread (which nearly nobody eats here, in France). I suspect that CXL over USB could find some uses, CXL on opto-linked SerDeses, or even CXL on the metal could win some Top Chef-style awards for taste (almond paste playing the role of firmware, over cooked, flaky, pastry dough for the hardware; the whole thing down the hatch for a lifetime of gustative memory goodness, to be shared, of course)!

Rambus should definitely get its cognitive kitchens running to cook something up in this area! 8^b

Reply
- Danny says:
  
  June 17, 2023 at 7:48 pm
  
  With all this talk of “cephalopodic hosts” and “cephalopodic memory servers” I wonder that by 2030 we may be looking at an algorithmic “Cthulu” manifesting itself in some data center? On a more serious note, I do see the metaphorical architecture scheme that CXL (sprinkled generously by the sublimations…err…additions of HP’s Gen-Z, IBM’s OpenCAPI and Xilinx’S CCIX) leading to at least some kind of emergent algorithmic framework similar to a cephalopod’s neural makeup where its tentacles are now found to be “somehwhat intelligent” on their own and in constant communication with the host cephalopod’s brain. I’m in NO way implying I believe that some emergent intelligence will come about simply from this latest interconnect scheme of the future. However could the very framework itself lead to the ability by data center designers and software engineers to develop a quasi independent node infrastructure regardless of if A.I. is used or not. That is to say, could nodes do their own thing independent of the main core but once the main core gives it the basic parameters of what the core is and what it wants to accomplish? And as such, the nodes could go about collecting data and doing something with it that not only satisfies the demands of the core, but makes the nodes more proficient as well? In other, other words, a symbiotic relationship in host rather than a symbiosis from outside entity with said host?
  
  Or has Cthulu already taken over my brain? I suppose to extend your gastronomic metaphor to the point of beating an expired equine….this is food for thought ?
  
  Reply
  - 8^b says:
    
    June 18, 2023 at 11:24 pm
    
    Cthulu stew notwithstanding, I do smell (cognitively) that the cephalopodic inner-symbiosis you evoke is indeed the azimuth towards which the “large systems” cabal is ambulating at this trifurcation, with its contemporary focus on composable disaggregated machinery (aka. modular podism). Symbiotic gastropodism might also work (esp. with butter and garlic) but the squid-like combination of smarts (CPU/brain) with tentacles (interconnection/serdes) feels like the defining characteristic of this new breed of alienware.
    
    To me, it is mainly about practical concerns of modular maintenance and upgradeability, but I’ll readily yield to the much more mouthwateringly tempting prospect for singular emergence of cosmic superintelligence, and hail the next coming of composable Chtulu! 8^b
    
    Reply
Mark Hahn says:

June 17, 2023 at 5:44 pm

Latency latency latency!
Only a fool would use cxl for pooled resources if end-to-end latency is noticeably worse than local. Switched GPUs or NVMEs, fine – people might not notice even a microsecond. But cxl people keep talking about ram, where 10 ns is noticable and 50 ns is a big deal.

Maybe talk to lifeforms more advanced that marketing types…

Reply
- Timothy Prickett Morgan says:
  
  June 18, 2023 at 8:59 am
  
  Yup. We get it on the latency, Mark.
  
  Reply
- HuMo says:
  
  June 18, 2023 at 11:38 pm
  
  I think it is that pareto surface issue (mentioned in a recent TNP interview) for tradeoffs between latency, bandwidth, and total memory (size). CXL favors total memory over the other two (essentially), and HBM favors bandwidth. We don’t really have a solution that favors latency (minimizing it) at this time it seems — achieving this will probably take deep meditative PhD efforts, to uncover the required memory-access kung-fu choreography (and kids these days are endlessly distracted by incoming text messages and notifications, which suggests it will be a while …).
  
  Reply
Thomas Hoberg says:

June 19, 2023 at 9:44 am

I keep wondering if the PCIe switch market was killed by too much greed.

As Avago/Broadcom consolidated most independent PCIe switch vendors, they seem to have raised the prices to the point where the switches which used to be relatively common even on consumer desktop boards, pretty much disappeared even from servers and became relagated to a storage niche that evidently didn’t support the vast scale you need to keep up.

With NVMe there seems to be such a big demand for affordable PCIe switching and what I find especially telling just how lopsided the PCIe switch market seems to be, when you can buy a Zen CPU with a multi-protocol (PCIe+SATA+USB+RAM+IF) switch in the form of an IOD for less than just a lonely switch chip at similar port numbers.

I’d really love to have a “state of the PCIe switch market” by you guys and gals (wither Nicole?)

Reply

Rest In Pieces: Servers And CXL

Sign up to our Newsletter

11 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Intel Sometimes Charges A Hefty Premium For Sapphire Rapids

AI To The Rescue For Server And Storage Spending In Q1

Intel Needs To Engineer Its Financial Future

11 Comments

Leave a Reply Cancel reply