What good is a floating point operation embodied in a vector or matrix unit if you can’t get data into fast enough to actually use the compute engine to process it in some fashion in a clock cycle? The answer is obvious to all of us: Not much.
People have been talking about the imbalance between compute and memory bandwidth for decades, and every year the high performance computing industry has been forced to accept less and less memory bandwidth per floating point operation because increasing memory bandwidth is exceedingly difficult in a fashion that doesn’t also end up being very pricey.
And while we are thinking about it, increasing memory capacity is also getting more difficult because fat memory is also coming under Moore’s Law pressures and making increasingly dense as well as increasingly faster memory is getting more and more difficult, and hence the price of memory has not come down as much as it otherwise. And thus, we do not have the kind of massive memory machines that we might have dreamed of decades ago.
We were reminded of this acutely during the Turing Award keynote by Jack Dongarra, well know to readers of The Next Platform as a distinguished researcher at Oak Ridge National Laboratory and research professor emeritus at the University of Tennessee. Like many of you, we watched the Turing Award keynote that Dongarra gave, talking about how he unexpectedly got into the supercomputing business and became the expert on measuring system performance on these massive machine – mostly by being part of the team that was constantly evolving math libraries as supercomputer architectures changed every decade or so. If you haven’t watched the keynote, you should, and you can do so at this link. This history is fascinating, and it forecasts how we will continue to evolve software as architectures continue to evolve.
But that’s what we are not going to talk about here.
What stuck out in our mind as we were watching Dongarra’s keynote is the massive overprovisioning of flops in today’s processors relative to memory bandwidth, and it was resonating in our head because that same week Intel had just announced some benchmark results on its upcoming “Sapphire Rapids” Xeon SP server CPUs, showing the benefit of HBM2e stacked memory, which has roughly 4X the memory bandwidth of plain vanilla DDR5 memory sticks used in modern server CPUs. (Sapphire Rapids has a 64 GB HBM2e memory option, and can be used in conjunction with DDR5 memory or instead of it.) The benefit of the HBM2e high bandwidth memory shows how much out of whack flops and bandwidth are:
The addition of HBM2e memory to the Sapphire Rapids CPU does not affect Dongarra’s beloved High Performance Linpack (HPL) matrix math test very much, as you can see, and that is because HPL is not particularly memory bound. But the High Performance Conjugate Gradients (HPCG) and Stream Triad benchmarks, both of which are memory bound like crazy, sure do see a performance boost just by switching memory. (We presume that the machines tested had a pair of top bin, 60-core Sapphire Rapids chips.) Under normal circumstances with the HPCG test, which is probably the most accurate test reflecting how some very tough HPC applications really are written (and by necessity, not by choice), the world’s fastest supercomputers are use anywhere from 1 percent to 5 percent of the machine’s potential flops. So by increasing this by a factor of 3.8X would be a very, very big improvement indeed if that performance can scale across thousands of nodes. (This remains to be seen, and HPCG is the test that will – or won’t – show it.
So just how far out of whack are flops and memory bandwidth with each other? Dongarra showed how it is getting worse with each passing architectural revolution in supercomputing:
And here is a zoom into the chart that Dongarra showed:
“When we look at performance today on our machines, the data movement is the thing that’s the killer,” Dongarra explained. “We’re looking at the floating point execution rate divided by the data movement rate, and we’re looking at different processors. In the old days, we had processors that basically had a match of one flops per one data movement – that’s how they were balanced. And if you guys remember the old Cray-1s, you could do two floating point operations and three data movements all simultaneously. So this is trying to get a get a handle on that. But over time, the processors have changed the balance. What has happened over the course of the next twenty years, from the beginning here is that an order of magnitude was lost. That is, we can now do ten floating point operations for every data movement that we make. And more recently, we’ve seen that number grow to 100 floating point operations for every data movement. And even some machines today are in the 200 range. That says there’s a tremendous imbalance between the floating point and data movement. So we have tremendous floating point capability – we are overprovision for floating point – but we don’t have the mechanism for moving data very effectively around in our system.”
The chart shows how generationally is has gotten worse and worse. And moving to HBM2e and even HBM3 or HBM4 and HBM5 memory is only a start, we think. And CXL memory can only partially address the issue. Inasmuch as CXL memory is faster than flash, we love it as a tool for system architects. But there are only so many PCI-Express lanes in the system to do CXL memory capacity and memory bandwidth expansion inside of a node. And while shared memory is interesting and possibly quite useful for HPC simulation and modeling as well as AI training workloads – again, because it will be higher performing than flash storage – that doesn’t mean any of this will be affordable.
We don’t yet know what even the HBM2e memory option on Sapphire Rapids will cost. If it gooses memory bound applications by 4X to 5X but the CPU costs 3X more, that is not a gain really on the performance per watt front that really gates architectural choices.
The HBM2e memory option on the future Xeon SP is a good step in the right direction. But maybe having a lot more SRAM in L1, L2, and L3 caches is more important than adding cores if we want to get the memory back in balance.
Having won the Turing Award gives Dongarra a chance to lecture the industry a bit, and once encouraged to do so, he thankfully did. And we quote him at length because when Dongarra speaks, people should listen.
“I have harped on the imbalance of the machines,” Dongarra said. “So today, we build our machines based on commodity off the shelf processors from AMD or Intel, commodity off the shelf accelerators, commodity off the shelf interconnects – those are commodity stuff. We’re not designing our hardware to the specifics of the applications that are going to be used to drive them. So perhaps we should step back and have a closer look at the how the architecture should interact with the with the applications, with the software co-design – something we talk about, but the reality is very little co-design takes place today with our hardware. And you can see from those numbers, there’s very little that goes on. And perhaps a good –better – indicator is what’s happening in Japan, where they have much closer interactions with the with the architects, with the hardware people to design machines that have a better balance. So if I was going to look at forward looking research projects, I would say maybe we should spin up projects that look at architecture and have the architecture better reflected in the applications. But I would say that we should have a better balance between the hardware and the applications and the software – really engage in co-design. Have spin-off projects, which look at hardware. You know, in the old days, when I was going to school, we had universities that were developing architectures that would, that would put together machines. Illinois was a good example of that – Stanford, MIT, CMU. Other places spun up and had had hardware projects that were investigating architectures. We don’t see that as much today. Maybe we should think about investing there, putting some research money – perhaps from the Department of Energy – into that mechanism for doing that kind of work.”
We agree wholeheartedly on the hardware-software co-design, and we believe that architectures should reflect the software that is running them. Frankly, if an exascale machine costs $500 million, but you can only use 5 percent of the flops to do real work, that is like paying $10 billion for what is effectively a 100 petaflops machine running at 100 percent utilization if you look at the price/performance. To do it the way Dongarra is suggesting would make all supercomputers more unique and less general purpose, and also more expensive. But there is a place where the performance per watt, cost per flops, performance per memory bandwidth, and cost per memory bandwidth all line up better than we are seeing today with tests like HPCG. We have to get these HPC and AI architectures back in whack.
The next generation of researchers, inspired by Dongarra and his peers, need to tackle this memory bandwidth problem and not sweep it under the rug. Or, better still for a metaphorical image – stop rolling it up in a carpet like a Mob hit and driving it out to the Meadowlands in the trunk of a Lincoln. A divergence of 100X or 200X is, in fact, a performance and an economic crime.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.