Generative AI is arguably the most complex application that humankind has ever created, and the math behind it is incredibly complex even if the results are simple enough to understand. GenAI also it has some serious bottlenecks when it comes to memory bandwidth and memory capacity, and these bottlenecks could be the driver for the adoption of memory godboxes that a number of different companies have been trying to bring to market over the past several years.
Generally, these memory servers use the CXL protocol to extend the main memory of systems, pooling many terabytes of DDR main memory so it can act as a relatively fast and fat cache for the wickedly high bandwidth but relatively low capacity HBM stacked memory that is commonly wrapped around GPUs and other kinds of XPU accelerators for AI workloads.
Enfabrica, with its new Emfasys memory cluster, is the latest to deliver a memory godbox in production, and KV caches for speeding up AI inference for ever-more-complex queries could turn out to be the killer application that Enfabrica and its peers have been waiting for.
Enfabrica was founded in early 2020 by Rochan Sankar, its chief executive officer, Shrijeet Mukherjee, its chief development officer. Among other things, Sankar was the director of product marketing and management at Broadcom for five generations of its “Trident” and “Tomahawk” datacenter switching ASICs and Mukherjee was director of engineering on the “California” Unified Computing System converged server-network system and then as vice president of software engineering at Cumulus Networks.
The company dropped out of stealth mode back in June 2021, saying vaguely that it would be taking on the “$10 billion I/O problem” in distributed systems, converging and simplifying many of the disparate interconnects within a server mode and across them in a cluster. Nearly two years later, Enfabrica raised the curtain a little bit on its Accelerated Compute Fabric, which sought to converge thousands of in-rack Serdes and dozens of distinct networking ASICs down to a memory and I/O pool that could deliver terabytes of DRAM memory with latencies on the order of hundreds of nanoseconds to accelerate all kinds of workloads.
It looks like the KV cache at the heart of AI inference will be the killer app that the Emfasys memory godbox is aimed at.
If HBM memory could be stacked up with dozens and dozens of DRAM chips and more stacks could affordably be wrapped around GPUs and XPUs, there would not be a need for a memory godbox at all to help with AI inference.
But HBM memory is getting more expensive faster than its performance and capacity is rising, and GPU and XPU makers are not inclined to make a compute engine with balanced memory and compute. Rather, they put the bare minimum of HBM they can on each device – which means a little more than the one they put out last year – and have customers scale out the compute in order to scale out the memory. Even those hyperscalers and cloud builders making their own XPUs for AI training and inference do the same thing, and that is because compute is cheap and easy and memory is expensive and complex.
In those cases where a workload is compute bound, these compute engines with not enough HBM memory sing. But for AI inference, they could use more memory, particularly as context windows for prompt inputs are growing.
Before we get into the Emfasys memory godbox, we need to talk a little bit about how AI inference makes use of HBM memory in GPUs and XPUs and is bottlenecking performance.
AI Inference Needs More Long Term Memory
Large language models chop up information into tokens and pass it forwards and backwards through a neural network emulated in software to create parameters. These parameters in a neural network are the weightings on connections between the virtual neurons, just like the voltage spikes in our actual neurons that help us think and imagine. We started AI first by using labeled datasets, but at a certain scale, if we learned that if we fed a model enough tokens – trillions of snippets of information – we can teach a neural network to chop any piece of data, like a picture or a block of text, into descriptive bits and then reassemble them in a way that it “knows” what that image or text is.
So, the more parameters you have, the richer the spiking dance on the neural network is. Tokens tell you how much you know, and parameters tell you how well you can think about what you know. Smaller parameter counts against a larger set of tokens gives you quicker, but simpler, answers. Larger parameter counts against a smaller set of tokens gives you very good answers about a limited number of things.
When training or inferring, the activated parameters have to be stored in the HBM memory in the compute engines running the neural network. For training, you need tens of thousands of compute engines to process trillions of tokens and create parameters that embody the model within a reasonable amount of time. If you need to train a model quicker, you have to throw more compute (and therefore more memory) at it.
With an LLM from a few years ago, you dump your prompt into the LLM’s API and that comprises a context window for the prompt. Many models have tens of thousands of tokens for their context windows (particularly mixture of experts models that are really a collection of models and a router that knows how to activate only necessary parts of the models to answer a prompt), but there are big models that have context windows measuring in the millions of tokens.
With inference, which is where the AI rubber will hit the enterprise road, there is an attention mechanism – like the one you are using inside of your head right now as you read – that remembers what is important about the query (this is the key) and what is important about the context (this is the value) to answer the prompt. The problem is that with each new token that is processed, the attention key and value vectors have to be calculated for all of the prior tokens that have been processed to update the attention weightings.
Thus, LLMs have been augmented with something called a KV cache, which can speed up token processing and generation by orders of magnitude by storing the prior attention KV vectors in memory so they do not have to be recomputed each time as the model runs its attention algorithms on each successive token.
These KV caches can be held in GPU and XPU memory, but with low hundreds of gigabytes of capacity per compute engine, they can fill up pretty fast if the context window is large. And remember, model weights and activation memory also have to be stored in that HBM memory as inference is running. By the way, the KV cache is used for both processing the context window and processing the output answer to a prompt, and it has to store all the KV vectors for both input and output.
As we have been fond of pointing out, caching embeddings and parameters is one of the reasons why Nvidia was so eager to create its own Arm-based server CPU and linking it to a GPU accelerator with HBM memory using NVLink ports and load/store NVLink memory sharing. The Grace processor has just under a half terabyte of LPDDR5 memory, but this is a hell of a lot more capacity than the 80 GB and 96 GB of the original “Hopper” H100 GPUs. And we strongly suspect that the future “Vera” CPUs coming next year will have a lot more memory capacity than Grace does because GPUs need direct access to host server memory precisely because KV caches and (embeddings used for recommendation engines) are going to keep growing.
So, there is the context window for you to think about the Emfasys hardware we are now going to show you. Memory is a bottleneck, and having a giant pool of shared memory to hold KV vectors and embeddings that operates at host main memory speeds is one way to make inference run faster and cheaper.
CXL Makes Good At Last
We did a very deep dive into the ACF architecture and its place in the interconnects of the datacenter back in March 2023, and we are not going to redo that in its entirety here. But a recap is perhaps in order.
The ACF-S device, which Enfabrica now calls a SuperNIC as well and which is known by the codename “Millennium,” is really a converged Ethernet and PCI-Express/CXL switch chip. (Why not a SuperSwitch? Because SuperMicro already uses that brand for its plain vanilla Ethernet switches based on Broadcom ASICs.) By converging these three tiers together, you can get rid of a top of rack Ethernet switch, a bank of network interface cards and host bus adapters, and a bunch of PCI-Express switches that are typically installed in compute racks. Here is the block diagram of the ACF-S SuperNIC:
At the time, we did not know the precise feeds and speeds of the ACF SuperNIC, but as you can see plainly, you could take a bunch of CPUs and GPUs with PCI-Express ports and create a pretty cool rackscale system. The bandwidth on the ports would be nothing like an NVLink Switch fabric connecting Nvidia XSM GPU sockets, of course, but there is a level of simplicity here as well as the risk of buying a new and unknown thingamabob.
Which is why the Emfasys memory godbox has been created as a standalone memory accelerator and extender for AI inference workloads. This is just one possible use of the ACF-S chippery, and we presume there will be others. Perhaps even with variations that include NVLink Fusion I/O chiplets that allow for the higher-end Nvidia GPU accelerators to be directly linked to the SuperNIC.
Here is what the Emfasys memory system looks like:
And remember this is just one use case with a special configuration designed specifically to augment the memory capacity of GPUs and XPUs in a system.
The Emfasys memory servers are in the center rack, and there are eight of them in the rack. Each memory server has nine of the SuperNICs installed, which present two CXL memory DIMMs per channel and which deliver 18 TB of DDR5 main memory of capacity using a pair of 1 TB DIMMs across those nine SuperNICs. (The feature image at the top of this story shows the Emfasys memory server with the cover off.)
Eight of these memory servers gives you 144 TB of memory smack dab in the middle of four GPU server racks. There are a pair of Emfasys initiators in each GPU server rack, which use PCI-Express MCIO links to hook down into the GPU servers and then 800 Gb/sec Ethernet ports using RoCE RDMA for low latency to hook out to the memory servers in the center of the row configuration.
We fully realize that 1 TB DDR5 memory sticks are very expensive, and you might be tempted to go with skinnier memory to save money. But even with 256 GB DDR5 memory with CXL hooks, you could get 4.5 TB per box and 36 TB total capacity in that memory rack. And at 192 GB per GPU, as is common these days, it would take 192 GPUs to provide that much memory capacity, and Nvidia “Blackwell” B200 GPUs probably cost on the order of $40,000 a pop. Wrapping systems around them and then under-utilizing them just for memory capacity seems silly. And in a GB200 NVL72 system, each B200 GPU only has access to its share of LPDDR5 memory on the Grace processor that is split between two Blackwell GPUs. It would take 144 Grace CPUs to add up to 36 TB of KV cache memory, and it would be chopped up into 288 pieces.
The Emfasys memory godbox – it is really a memory cluster – machine can stripe data across all of those memory nodes for both reading and writing.
“We basically built a conventional cloud storage target with a ton of memory, with dozens of ports into this chip, and then stripe the transactions across all the memory,” Sankar tells The Next Platform. “Why did people say CXL is not useful for AI? Because they were looking at the bandwidth into one CXL port. But if you had a very fat, wide memory controller that could write across everything, so your data is striped as wide as you want – and by the way, that’s across one write from a one initiator doing it across if you had an application where this is effectively networked as one node – and you could use all the ports to actually break up your write across them. Now you can use the entire bandwidth across multiple links for your write. You can get as much memory bandwidth as you want through that, or throughput, I should say, so. Taking 100 gigabyte file can be a very low transfer time, depending on how many wires you use and how many memory channels you attach.”
The Emfasys box can support 18 parallel memory channels today, but is going to 28 channels next year. (It is not clear if that is an upgrade of the number of ACF-S SuperNICs in the box or a change in the ACF-S silicon.)
Here is a more detailed block diagram that shows how the KV cache setup is wired together:
This shows four GPU servers linked to a single memory godbox, but obviously it can scale larger than that.
Sankar is not providing pricing on the Emfasys memory godbox, but says that CXL memory does not have much of a premium over fat DDR5 memory. The main thing is that the Emfasys memory box can be linked to any GPU or XPU host through PCI-Express MCIO links and then it all looks like extended memory hanging off the GPU host server, just like Grace does in a Grace-Hopper or Grace-Blackwell complex.
When pressed for some kind of price/performance benefit, Sankar said that adding a rack of Emfasys machines to a quad of Nvidia GB200 NVL72 rackscale systems would cut the cost per token on AI inference in half. Considering that memory servers are not going to be cheap, that tells you how under-utilized the GPUs must be because of their memory capacity limits. To cut the cost of tokens in half by adding what might be millions of dollars of memory servers to a quad of Nvidia rackscale machines means that the GPUs have to more than double their throughput by adding this memory.
That will, no doubt, raise a whole lot of eyebrows and get more than a few hyperscalers and cloud builders picking up the phone.