If a few cores are good, then a lot of cores ought to be better. But when it comes to HPC this isn’t always the case, despite what the Top500 ranking – which is stacked with 64-core Epycs – would lead you to believe.
Speaking with executives at Atos and Lenovo’s HPC division, it becomes clear that while more cores are nice, it’s memory bandwidth, faster I/O, and higher clock speeds that customers are really after.
Yet AMD and Intel have been undeterred in their quest to push core counts ever higher with each subsequent generation. AMD’s “Genoa” Epyc 9004 boasts up to 96 cores and its upcoming “Bergamo” chips will boost the count to 128. Meanwhile, Intel’s “Sapphire Rapids” Xeon SPs top out at 60.
The reasons are obvious. Intel and AMD – and Ampere Computing for that matter – do big business selling high-core-count chips to hyperscalers and cloud providers for whom more cores equal more customers per node. The economics are rather simple and in that arena cores are king.
That’s not so much the case when it comes to mainstream HPC workloads. What’s more, chasing higher core counts introduces a couple of problems that are especially relevant for HPC customers. The most pressing of these being bandwidth.
Out Of Balance
“A lot of our HPC users that are not as core sensitive – they don’t care as much about cores,” Scott Tease, Lenovo’s vice president of HPC and AI, tells The Next Platform. “What they do care about is the higher memory bandwidth.”
This shouldn’t come as a surprise to many. More cores also means spreading what little memory bandwidth is available ever thinner. And while both AMD and Intel benefit from faster DDR5 memory this generation, boosting bandwidth by about 50 percent over DDR4, that doesn’t move the needle much when chipmakers also increased the number of cores by the same margin. To account for this, Intel and AMD have taken very different approaches.
Let’s start with Intel, which has arguably taken the more interesting route by putting 64 GB/sec of HBM2e stacked memory adjacent to its Sapphire Rapids Xeon Max CPU dies. This works out to more than 1 TB/sec of memory bandwidth, more than three times that of the Xeon SPs with normal DDR5 memory channels. On its 56-core Max Series CPU part, that works out to about 18 GB/sec per core of bandwidth. For comparison, AMD’s top-specced part may have 40 more cores, but only manages to deliver 4.8 GB/sec of bandwidth to each core.
“That’s quite a substantial increase in bandwidth,” Jean-Pierre Panziera, chief technology officer of HPC at Atos, tells The Next Platform of Xeon Max. “For applications that are bandwidth sensitive –for example, a lot of the computational fluid dynamics workloads for climate for weather forecasting – this could bring a lot of improvement.”
Of course, there’s a catch: 64 GB isn’t a whole lot of memory to work with – between 1.14 GB and 2 GB per core depending on which Xeon SP you’re looking at. You can expand that using DDR5, but then you’re dropping down to memory that’s about a third as fast and relying on the chip’s firmware or the ISV integration to handle data movement.
However, for some workloads this may be more than enough, notes Tease. “I think we may find that a lot of the reasons that people have adopted a GPU, maybe some of those workloads can easily shift back to a CPU if it had HBM on it.”
AMD, on the other hand, has stuck with the tried-and-true approach adding more memory channels at the expense of board space and ever so slightly higher latencies. The Genoa chip now boasts 12 memory channels across the product stack, which combined with DDR5 memory’s higher transfer rate, works out to 460 GB/sec of bandwidth – more than twice that offered by “Milan” Epyc 7003s.
While it won’t get you anywhere near Intel’s Max Series in terms of sheer bandwidth, it makes a big difference, especially on AMD’s lower core-count parts. And because AMD is using regular old DDR5, they’re not stuck relying on software tie-ins for memory tiering as is the case with Intel’s Xeon Max.
When it comes to balancing cores and memory bandwidth, Intel has a “clear advantage,” according to Tease. “I would like to see HBM more widely promoted in the industry,” he said. This is an attitude we share, particularly for HPC and AI workloads, along with some industry luminaries.
AMD’s X Factor
It is hard to talk about Max Series CPUs without drawing parallels to AMD’s X-series parts, which rather than HBM layer SRAM atop the CPU dies to bolster the chip’s L3 cache.
The technology debuted with the launch of Milan-X during AMD’s Accelerated Data Center virtual event in late November. Using a technique called 3D-V-Cache – named for the advanced packaging technique used to layer additional SRAM atop the chips Core Complex Dies (CCDs) — AMD was able to tack an additional 64 MB of L3 cache on each die for a total of 96 MB each. On its top specced chip, that worked out to 768 MB of L3 cache.
By caching more of the workload in L3, AMD claimed it could increase the throughput in bandwidth intensive workloads by a significant margin – 66.4 percent in the case of a Synopsys VCS test.
However, neither Tease nor Panziera are sold on 3D V-Cache just yet and the price-to-performance appears to be a primary concern.
“What we’ve seen so far from all the applications and benchmarking – and here I’m just talking about HPC – is the increase in performance is not matched by the increase in price,” Panziera said. “The cache – it’s bringing something to the table, not so much for HPC.”
For Tease, HBM promises to be a more flexible medium. “There are a lot of workloads, that are like EDA for CFD type workloads, that it would be great if they could fit in the cache. But if they’re bigger than the cache you’ve still got to go out to main memory and it slows things down quite a bit,” he said. “HBM has a higher likelihood to be able to fit that code inside the HBM and really take advantage of that much higher access rate.”
AMD has yet to share details on Genoa-X, but we can surmise it will offer even more cache than the outgoing Milan-X part.
Higher Frequencies, Please
The status quo for the past ten to fifteen years has been that CPU base clocks have been stuck in the 2 GHz to 3 GHz range, Panziera said.
The reason shouldn’t surprise anyone here: The more cores you pack into the chip the less power you’ve got to drive clocks. As a result, base clocks have largely languished.
Intel’s 4th-Gen Xeon SPs, announced in January after months of delays, are no exception. Only a handful of chips can manage better than a 3 GHz base frequency. They also don’t boost that high either, with most topping out at less than 4.1 GHz in Turbo mode when a lot of the cores are idle.
That’s an improvement to be sure, but boosted clocks are a difficult phenomenon to predict since they’re dictated by complex algorithms that weigh temperatures, power budget, core loading, and even the instruction set to determine how high each core can should boost. As such, we can only really look at base clocks with any degree of certainty as the only way the chip will drop below them is if it’s thermally compromised. And in that case, you’ve got bigger problems to worry about.
The same has largely been true of AMD until recently. However, with the launch of AMD’s Zen 4 architecture, a move to TSMC’s more efficient 5nm manufacturing process, and a higher overall TDP, Genoa has fared better. The chipmaker has several performance (F) SKUs capable of more than 4 GHz. The most obvious consequence is a higher thermal output.
We already know Zen 4 is capable of pushing even higher clocks. This is the same architecture used in AMD’s consumer-focused Ryzen processors. The chipmaker’s 16-core 7950X has a 4.5 GHz base clock and is capable of boosting to 5.7 GHz on around 230 watts – albeit not on all cores.
“We’re kind of swimming in cores,” said Tease, who would much rather see substantially higher core clocks on HPC-focus parts from chipmakers. “We’re seeing very, very few customers that are really looking for 96 cores or 128 cores. It’d be much nicer to have an 8 core or 16-core part at a 4-plus GHz kind of frequency.”
APUs On The Horizon
GPUs not CPUs have become the dominant driver of performance gains, at least for accelerated workloads in recent years.
“The CPUs role in server usage or server design has kind of changed from the thing that most of the performance is being run on. In many cases. Now it’s basically the traffic cop,” Tease said.
So, it shouldn’t be surprising that chipmakers should try to combine the best of both worlds, melding CPU with GPU. This of course isn’t a new idea. It’s just been relegated to consumer hardware like thin and light notebooks until now.
AMD recently offered a peek at its upcoming Instinct MI300 accelerated processing unit (APU), which will make judicious use of the chipmaker’s chiplet and 3D-packaging techniques and which we detailed here. The MI300A is features nine 5 nanometer and four 6 nanometer chiplets, which themselves will be flanked by two banks of HBM, at least if AMD’s rendering is anything to go by. (We think it is.)
Intel’s “Falcon Shores” CPU-GPU hybrid, previewed last February, will follow a similar trajectory. Combining its X86 CPU cores and Xe graphics cores at the heart of its impending “Ponte Vecchio” GPUs and upcoming “Rialto Bridge” GPUs. The details on Falcon Shores are a bit thin still, but we know that like the AMD MI300 series, Falcon Shores will include CPU core and GPU cores with a shared a pool of “extreme bandwidth” memory, which could be HBM3 stacked memory or something else. The XPU, as Intel prefers to call this device, will also be manufactured using the chipmaker’s angstrom-era manufacturing tech.
And while not quite the same, Nvidia’s Grace-Hopper superchip is more akin to AMD’s MI300 or Intel’s Falcon Shores than its different. More importantly, the design — which packages a Grace CPU die with 512 GB of LPDDR5X memory alongside a Hopper GPU die with 80 GB of HBM, interconnected by high spedd NVLink ports – will beat both AMD and Intel to market.
However, this transition introduces challenges for OEMs on a couple of fronts. Thermal management being chief among them. Today CPUs are consuming north of 400W while GPUs are pushing 600W. “I would expect some of these APUs to come in at over a kilowatt,” Tease said.
At that point, liquid cooling isn’t just a nice to have but rather a requirement.
Another challenge is software support. While Intel and Nvidia have a long history of software development in support of their chips, the same can’t be said of AMD. “We like the looks of the MI300 and the roadmap, but the software ecosystem is still a problem,” said Tease. “It’s still not turnkey and easy for run-rate customers.”
Panziera expressed similar concerns about the maturity of software developed for Arm processors, like Nvidia’s Grace. “You may run into a situation where you may have 70 percent of the applications that are up and running on this platform, but the 30 percent you might be missing will force you to go back to X86 for example.”
We have no doubt that these CPU-GPU hybrids will address some existing HPC bottlenecks, but probably not all of them and we do not know how they might be helpful with AI training and data analytics workloads. But we will keep an eye out.