To a certain extent, the only thing that really matters in a computing system is what changes in its memory, and to that extent, this is what makes computers like us. All of the computing capacity in the world, or the type of manipulation or transformation of that data, doesn’t matter as much as the creation of new data that is then stored in memory so we can use it in some fashion at high speed.
The trouble with systems and their memory is that you can’t get a memory subsystem that has it all.
You can turn 3D XPoint into a kind of main memory, as Intel has shown with its Optane PMem DIMM form factor; this persistence in the PMEM is useful, but you end up with a memory that is more expensive than flash and slower than normal DRAM, so it cannot really outright replace either but it can be used as another layer in the memory hierarchy – and is in certain systems and storage.
With plain vanilla DRAM, you can build a fat memory space for applications and data to frolic within, but it can get pricey and the bandwidth is not great. The increasing memory speeds and larger number of controllers put onto CPUs helps, but the latency remains relatively high (at least compared to HBM stacked memory) and the bandwidth is nowhere near as high as it is with HBM. The industry does know how to make HBM in high volume, and therefore the yields are low and the unit costs are higher.
The ubiquity of DDR DIMMs – there have been five generations now – and their mass production means it is low cost even if bandwidth challenged. DDR SDRAM memory, which was specified in 1998 by JEDEC and widely commercialized in 2000, debuted at a low of 100 MHz and a high of 200 MHz and delivered between 1.6 GB/sec and 3.1 GB/sec of bandwidth per channel. Through the DDR generations, the memory clock rate, the I/O bus clock rate, and the data rate for the memory modules have all ramped, and so has the capacity and the bandwidth. With DDR4, still commonly used in servers, the top-end modules have memory running at 400 MHz, I/O bus rates of 1.6 GHz, 3.2 GT/sec data rates, and 25.6 GB/sec of bandwidth per module. DDR5 doubles the bandwidth up to 51.2 GB/sec and doubles the maximum capacity per memory stick to 512 GB. The JEDEC spec for DDR5 allows for speeds up to 7.2 GT/sec, and we will see how that affects system design.
Our guess is that for many devices, this capacity is great but the bandwidth will simply not be enough. And so we will end up with a split memory hierarchy inside the node and close to the compute engine for the foreseeable future. Or, more precisely, customers will have to choose between devices that have DDR5 memory and HBM3 memory, they may mix them within systems and across nodes in a cluster, and some of them may have Optane or some other kind of ReRAM or PCM persistent memory where appropriate.
Programming across main memory types and speeds is going to remain a problem for mixed memory systems until someone creates a memory processing unit and a memory hypervisor that can provide a single level memory space for compute engines to share – Memverge, VMware are you listening? (We need more than a memory hypervisor, we need something that accelerates it.)
Or, companies will use one type of memory to cache the other. Fast and skinny memory can cache the fat and slow memory, or vice versa. So in many hybrid CPU-GPU systems today, the GPU memory is where most of the processing is done and the coherence across the DDR memory in the CPU and the HBM memory in the GPU is used mostly to have that DDR memory act as a giant L4 cache for the GPU – yes, the CPU has been relegated to being a data housekeeper. Conversely, with the Xeon SP systems that support Optane DIMMs, in one of the modes (and the one that is easiest to program) the 3D XPoint memory is treated like slow main memory and the DDR4 or DDR5 DIMMs in the machine are a super-fast cache for the Optane memory.
As we pointed out last July when previewing what HBM3 memory might mean to systems as it becomes available this year, we think that HBM memory – never put the noun of the thing you are creating in the abbreviation, because we can’t say High Bandwidth Memory memory, and what was wrong with HBRAM? – is going to be used in all kinds of systems and will eventually become more ubiquitous, and therefore less expensive. We don’t all still use core memory, after all, and so many workloads are constrained by memory bandwidth, not compute. That is why we believe there will be versions of HBM with skinnier 512-bit buses and no interposer as well as those that have the 1,024-bit bus and an interposer.
With HBM memory (and also the now defunct Hybrid Memory Cube stacked memory once created by Intel and Micron and used in its Xeon Phi accelerators), you can stack up DRAM and link it to a very wide bus that is very close to a compute engine and drive the bandwidth up by many of factors to even up to an order of magnitude higher than what is seen on DRAM attached directly to CPUs. But that fast HBM memory is skinny, and it is also considerably more expensive. It is inherently more expensive, but the price/performance for the memory subsystem probably comes out better.
We don’t have a good sense of how much HBM costs compared to DDR main memory, but Frank Ferro, senior director of product marketing for IP cores at Rambus, knows what it costs compared to GDDR memory.
“The adder for GDDR5 versus HBM2 was about 4X,” Ferro tells The Next Platform. “And the reason is not just the DRAM chips but the cost of the interposer and the 2.5D manufacturing. But the good news with HBM is that you get the highest bandwidth, you get very good power and performance, and you get very small footprint area. You have to pay for all of that. But the HPC and hyperscale communities are not particularly cost constrained. They want lower power, of course, but for them, it’s all about the bandwidth
Nvidia knows of the benefits of HBM3 memory, and is the first one to bring it to market in the “Hopper” H100 GPU accelerator announced last month. That was pretty hot on the heels of JEDEC putting out the final HBM3 specification in January.
The HBM3 spec is coming in faster than SK Hynix was hinting last July with its early work, when it said to expect at least 5.2 Gb/sec signaling and at least 665 GB/sec of bandwidth per stack.
The HBM3 specification calls for the per-pin signaling rate to double to 6.4 Gb/sec from the 3.2 Gb/sec used with Samsung’s implementation of HBM2E, an extended form of HBM2 that pushed the technology beyond the official JEDEC spec, which set the signaling rate to 2 Gb/sec initially. (There was an earlier variant of HBM2E that used 2.5 Gb/sec signaling, and SK Hynix used 3.6 Gb/sec signaling to try to get an HBM2E edge over Samsung.)
The number of memory channels was also doubled from 8 channels with HBM2 to 16 with HBM3, and there is even support for 32 “pseudo channels” in the architecture, by which we presume there is some sort of interleaving possible across DRAM banks as was commonly done in high-end server main memories. The HBM2 and HBM2E variants could stack up DRAM 4, 8, or 12 chips high, and HBM3 is allowing for an extension to stacks of DRAM that are 16 chips high. DRAM capacities for HBM3 are expected to range from 8 Gb to 32 Gb, with a four-high stack using 8 Gb chips yielding 4 GB of capacity and a 16-high stack with 32 Gb chips yielding 64 GB per stack. First-generation devices using HBM3 memory are expected to be based on 16 Gb chips, according to JEDEC. The memory interface is still 1,024 bits wide and a single HBM3 stack can drive 819 GB/sec of bandwidth.
So with six stacks of HBM3, a device could, in theory, drive 4.8 TB/sec of bandwidth and 384 GB of capacity. We wonder what a Hopper H100 GPU accelerator with that much bandwidth and capacity would weigh in at in terms of cost and thermals. . . .
Because the upper echelons of compute are impatient about memory bandwidth, Rambus is already pushing out beyond the relatively new HBM3 spec, with something that might eventually be called HBM3E in the chart above. Specifically, Rambus already has signaling circuits designed that can drive 8.4 Gb/sec signals for HBM3 pins and deliver 1,075 GB/sec – yes, 1.05 TB/sec – of bandwidth per HBM3 stack. Six of those stacks and you are up to 6.3 TB/sec of memory bandwidth. This is possible through custom HBM3 memory controllers and custom HBM3 stack PHYs. (Rambus had signaling up to 4 Gb/sec on HBM2E, by the way.)
Such bandwidth might actually keep a compute device like the Nvidia Hopper GPU, or a future Google TPU5 machine learning matrix engine, or pick the device of your dreams well fed with data. We shudder at the watts and the costs, though. But again, if bandwidth is the bottleneck, maybe it makes sense to invest more there and liquid cool everything.
Somebody please build such a beast so we can see how it performs and analyze its economics.
Your move, Samsung and SK Hynix.
MemVerge is listening! You need a single level memory space for compute engines to share. A Memory Lake!
Great article! I’m fairly new to the hardware world and partly reading NextPlatform to try to catch up to what people are talking about. What do the “X/sec signalling” metrics refer to?
Thanks for reading, Ben. It is the signaling frequency on each pin in the DRAM stack. The number of pins and their rate of signal is what drives the bandwidth, the density of chips and the height of the stack drives the capacity. I want 1 TB HBM at 10 Tb/sec. I also want a little red pony and a sailboat. And what the hell, a puppy, too. But HBM costs maybe 4X to 5X that of regular DRAM, so what I want is not necessarily possible until interposer manufacturing costs come way down.