If large language models are the foundation of a new programming model, as Nvidia and many others believe it is, then the hybrid CPU-GPU compute engine is the new general purpose computing platform.
After years and years of having others build these hybrid machines based on its high-end SXM streaming multiprocessor modules or PCI-Express cards, with the latest GH200 superchips, which combine a “Grace” CPU based on the Arm architecture and a “Hopper” GPU based on the sixth true generation of GPU compute engines and which both come from Nvidia, the company is taking the AI infrastructure bull by the horns and architecting a complete system suitable for running the largest AI training and inference workloads.
While Nvidia has been shipping its Grace-Hopper superchips in volume since May and has standardized server designs based on HGX form factors for Hopper-only GPU boards and MGX form factors that are used for a mix of Grace and/or Hopper chips, the one you will probably want to get your hands on is the second generation GH200 superchip, which was recently announced at the Siggraph 2023 conference while The Next Platform was on vacation. (We are playing catch up in our analysis.)
With this second generation GH200 superchip, the Grace CPU and the Hopper GPU are exactly the same, with the Grace chip being based on the “Demeter” Neroverse Arm V2 cores from Arm Ltd, which Nvidia tried to buy for $40 billion three years ago and which has just filed to go public – again. The original Hopper SXM5 GPU compute engine had 80 GB of HBM3 memory with 3.35 TB/sec of memory bandwidth. Last year, when the Hopper GPUs were launched, this SXM5 device had six stacks of HBM3 memory, but only five of them were active (we suspect for yield reasons), so they only had 80 GB of capacity instead of the 96 GB you would expect. The PCI-Express version of Hopper had the same five out of eight HBM3 memory stacks working and it only delivered 2 TB/sec of memory bandwidth (due presumably to a lower clock speed that burned a lot less juice and created a lot less heat).
When the Grace-Hopper SXM5 superchip – formally known as the GH200 – was launched, Nvidia was able to fire up all six stacks and get 96 GB of memory and 4 TB/sec of memory bandwidth out of the Hopper GPU. With the second generation GH200 superchip, Nvidia is moving the Hopper part of the compute complex to HBM3e memory and it is able to boost the capacity to 141 GB and the bandwidth to 5 TB/sec. That is a 76.3 percent increase in memory capacity and a 49.3 percent increase in memory bandwidth compared to the original Hopper SXM5 device announced last year.
Ian Buck, general manager of hyperscale and HPC at Nvidia, tells The Next Platform that the memory upgrade is enabled by a shift from HBM3 to HBM3e and that the Hopper GPU was designed from the get-go to support the faster and denser HBM3e memory so it could intersect the memory technology transitions and take advantage of them. We strongly suspect that this capacity and bandwidth is not an increase in the number of memory chips in an HBM stack, and unless the Hopper SXM5 package was completely redesigned, there are no more than six stacks, either.
Buck would not confirm who the HBM3e memory supplier was for the second-gen GH200 superchip, but Samsung, SK Hynix, and Micron Technology all have HBM3e memory in the works and any of them could, in theory, be the supplier since this revamped superchip is not sampling until the end of the year and is not shipping until the second quarter next year. Buck did say that Nvidia has multiple suppliers for HBM memory, and we think this is wise given the high cost and difficulty of manufacturing this stuff compared to normal DRAM. For all we know, to boost manufacturing output, all three vendors are supplying HBM3e memory to Nvidia for its GPU engines.
When we asked about when the regular Hopper SXM5 and Hopper PCI-Express cards might see an HBM3e upgrade, Buck was mum on the subject, and similarly when we suggested that perhaps the LPDDR5 memory in the Grace CPU could get an upgrade, too, to increase its memory capacity and possibly its bandwidth, Buck gave us the standard line about not talking about unannounced products.
Nvidia has not given pricing on any of the Hopper or Grace-Hopper compute engines, and is similarly not talking about whether this increased memory capacity and memory performance is being given for free. (We strongly suspect not.) He did say that Nvidia expects that system builders buying Grace-Hopper superchips will “fairly quickly” move to this second generation.
For HPC and AI applications that are memory capacity and memory bandwidth bound, the HBM3e memory will be a boon for performance on those applications.
“I expect for bandwidth limited applications to get near that 1.5X increase,” says Buck about the next-gen GH200. “We won’t always be able to push that level, of course, but it is going to be in that ballpark. What’s interesting also is the capacity increase because you can fit a larger model on a single GPU and and now with the CPU-GPU combined you effectively have almost 700 gigabytes of combined memory, so so you can do more with a single GPU. It will have increased performance, but you won’t have to necessarily require two GPUs to run a larger model.”
That’s the first announcement from Siggraph. The second one is that Nvidia has come up with a two-socket Grace-Hopper superchip that has 900 GB/sec direct NVlink ports connecting them together into a shared memory complex with two Grace CPUs and two Hopper GPUs. In fact, it is a four-way link between the machines just like we see in four-way CPU systems, so each device can reach out and talk to the memory of any of the other devices in the complex. Call it asymmetrical NUMA, if you will.
“It basically changes these two processors – there is one giant GPU, one giant CPU – int a super-sized superchip,” Jensen Huang, Nvidia co-founder and chief executive officer, explained in his keynote at Siggraph. “The CPU now has 144 cores, the GPU has 10 TB/sec of frame buffer bandwidth and 282 GB of HBM3e memory. Well, pretty much you can take any large language model you like and put it on this and it will inference like crazy. And the inference cost of large language models will drop significantly because look at how small this computer is. And you can scale this out in the world’s datacenters – you can connect it with Ethernet, you can connect it with InfiniBand.”
That bit about lowering the cost of inference is key because when you have to do inference on the same machine you do training on, it is very expensive unless you can lower the cost of the training to where you wanted the inference to be. It remains to be seen how much cheaper the Grace-Hopper approach will be compared to building machines that look like the DGX H100 servers that Nvidia makes that are based on eight-way GPU complexes using Hopper SXM5 units, all interlinked with NVSwitch fabrics.
It really comes down to how much less expensive a Grace CPU is compared to buying an X86 host based on Intel or AMD X86 processors. The latter will be more expensive, we think, but will also allow for a much larger CPU memory space and flash storage footprint, as the DGX H100 servers and their HGX clones do. We link the idea of making MGX Grace-Hopper clusters using NVSwitch fabrics to interconnect up 256 GPUs and then using InfiniBand to cross couple multiple pods together into a superpod. And it would be fascinating to see how a Grace-Hopper superpod with 256 GPUs performs against a Hopper-based DGX H100 superpod that uses Intel “Sapphire Rapids” Xeon SPs on the CPU hosts, has the same 256 Hopper GPUs, and NVSwitch inside of the fatter nodes. Given the higher bandwidth and memory capacity on the second-gen Grace-Hopper GH200s compared to the GH100 SXM5, it is not hard to figure out what would win.
And if and when the GH100 SXM5 is moved to HBM3e stacks with 141 GB and 5 TB/sec capacity, we suppose it will all come down to the nature of the AI training and inference workloads and how they react to a hierarchy of memory and networking.
For the H200, it would have been great to discuss if you think it is a reaction to the AMD MI300 having 5TB/s bandwidth and 192GB of HBM or if it looks like the normal upgrade cadence.
I feel like having the H200 announced just after H100 becoming available is a bit of an annoying move for everyone who ordered H100s.
I would have thought my Twit, or X, or whatever conveyed that. But yes, I think it absolutely is a reaction to higher capacity GPUs and NNPs. Which I should have been more explicit about. Nvidia can’t be hanging back on the memory capacity. And yet, who knows when the actual freestanding H100s get the memory upgrade? For now, only Grace-Hopper H100s are getting it.
A mid-generation increase in memory capacity happened for the V100 (16 GB to 32 GB) and the A100 (40 GB to 80 GB). It’s already been established as the standard productization pattern. Memory bandwidths were also increased in the previous mid-generation refreshes. The reason for the switch from HBM 3 to HBM 3e this time around is because the technology was not available for the original SKUs but will be available in volume in time for the refreshed SKU. As mentioned in this article, Nvidia planned from the design of the original GH100 for a mid-generation upgrade to HBM 3e. That decision would have been made at least 2 years ago, likely 3 or more. So there’s no reason to invoke the MI300 for anything to do with this announced mid-generation refresh.
If anything, the fact that this was only announced so far for the GH200 and not the standard GH100 suggests Nvidia is adopting a strategy springing from a sense of security in its market position rather than one that is concerned with the threat of the MI300 as competition.
I think the memory upgrade was slated for GTC 2024 and got pulled in.
Yes, the timing of the announcement may have been influenced by the MI300, but the existence of the product and the timing of its release is likely unrelated.
Agreed.
And remember, on a two-year cadence, we expect a kicker to the Hopper next year. And that will be using HBM3e and will be able to do whatever HBM4 is, too.
Can’t wait to see those engines, and MI300A, in action (and compare them also to SR+PV)! Hopefully the CoWoS packaging bottleneck eases-up and they all become available in higher volumes, and lower price points!
Looking back at TNP’s “Charm of AMD” and “Nvidia Embraces Grace” pieces (and nV blog on nvidia-hopper-architecture-in-depth), it seems that MI300A wins out on FP64 (from 9 joules per teraflop in scalar mode, down to just 1 or 2 J/TF for matrix). GH200 becomes more competitive in the lower precision modes, say 0.6-1.2 J/TF in FP32 tensor (vs 1.1-2.1 for MI300A FP32 matrix).
Their matrix/tensor feeding abilities likely differ as well (and are key, as stressed by TNP) and this GH200 HBM update, to 141 GB, should help it along (MI300A has 128 GB). Then again, MI300A has its HBM in 8 stacks, to GH200’s 6 stacks. Conversely, GH200 has 72 good ARM CPUs, to MI300A’s 24 good x64 CPUs … and anti-conversely, MI300A’s memory is fully shared between CPUs and GPUs, vs split between them on GH200 … tough call!
I guess the only way to know is to see how they test out at the system level, with pricing and thermals added to the mix. It is too close for me to call, and I think it would be funny if the resulting machines have about the same thermals at about the same cost for the same performance. THEN we would truly have at least two ways to skin the same AI cat.
… and with cats being so hard to herd (prior to the skinning part) … but I’d almost guess (at equal power consumption) MI300A winning HPL and HPCG types of contests (more FP64 perf. and HBM stacks), and GH200 winning HPL-MxP and GRAPH500 contests (more low-prec. perf. and more CPUs). Each would win Green500 in those corresponding categories (though it is commonly calculated only for HPL I think). They’re both winning tech, and hopefully priced so that we (users, who are very nice people) also win!
Yes. I think you’re right. It will come down to money. And which one is more general purpose. And which one you can buy.
“(…)resulting machines have about the same thermals at about the same cost for the same performance(…)”
In theory it`s possible, but it won’t happen. Practically impossible 🙂 The temperatures will be different. And the better the thermals, the higher the price.