If large language models are the foundation of a new programming model, as Nvidia and many others believe it is, then the hybrid CPU-GPU compute engine is the new general purpose computing platform.
After years and years of having others build these hybrid machines based on its high-end SXM streaming multiprocessor modules or PCI-Express cards, with the latest GH200 superchips, which combine a “Grace” CPU based on the Arm architecture and a “Hopper” GPU based on the sixth true generation of GPU compute engines and which both come from Nvidia, the company is taking the AI infrastructure bull by the horns and architecting a complete system suitable for running the largest AI training and inference workloads.
While Nvidia has been shipping its Grace-Hopper superchips in volume since May and has standardized server designs based on HGX form factors for Hopper-only GPU boards and MGX form factors that are used for a mix of Grace and/or Hopper chips, the one you will probably want to get your hands on is the second generation GH200 superchip, which was recently announced at the Siggraph 2023 conference while The Next Platform was on vacation. (We are playing catch up in our analysis.)
With this second generation GH200 superchip, the Grace CPU and the Hopper GPU are exactly the same, with the Grace chip being based on the “Demeter” Neroverse Arm V2 cores from Arm Ltd, which Nvidia tried to buy for $40 billion three years ago and which has just filed to go public – again. The original Hopper SXM5 GPU compute engine had 80 GB of HBM3 memory with 3.35 TB/sec of memory bandwidth. Last year, when the Hopper GPUs were launched, this SXM5 device had six stacks of HBM3 memory, but only five of them were active (we suspect for yield reasons), so they only had 80 GB of capacity instead of the 96 GB you would expect. The PCI-Express version of Hopper had the same five out of eight HBM3 memory stacks working and it only delivered 2 TB/sec of memory bandwidth (due presumably to a lower clock speed that burned a lot less juice and created a lot less heat).
When the Grace-Hopper SXM5 superchip – formally known as the GH200 – was launched, Nvidia was able to fire up all six stacks and get 96 GB of memory and 4 TB/sec of memory bandwidth out of the Hopper GPU. With the second generation GH200 superchip, Nvidia is moving the Hopper part of the compute complex to HBM3e memory and it is able to boost the capacity to 141 GB and the bandwidth to 5 TB/sec. That is a 76.3 percent increase in memory capacity and a 49.3 percent increase in memory bandwidth compared to the original Hopper SXM5 device announced last year.
Ian Buck, general manager of hyperscale and HPC at Nvidia, tells The Next Platform that the memory upgrade is enabled by a shift from HBM3 to HBM3e and that the Hopper GPU was designed from the get-go to support the faster and denser HBM3e memory so it could intersect the memory technology transitions and take advantage of them. We strongly suspect that this capacity and bandwidth is not an increase in the number of memory chips in an HBM stack, and unless the Hopper SXM5 package was completely redesigned, there are no more than six stacks, either.
Buck would not confirm who the HBM3e memory supplier was for the second-gen GH200 superchip, but Samsung, SK Hynix, and Micron Technology all have HBM3e memory in the works and any of them could, in theory, be the supplier since this revamped superchip is not sampling until the end of the year and is not shipping until the second quarter next year. Buck did say that Nvidia has multiple suppliers for HBM memory, and we think this is wise given the high cost and difficulty of manufacturing this stuff compared to normal DRAM. For all we know, to boost manufacturing output, all three vendors are supplying HBM3e memory to Nvidia for its GPU engines.
When we asked about when the regular Hopper SXM5 and Hopper PCI-Express cards might see an HBM3e upgrade, Buck was mum on the subject, and similarly when we suggested that perhaps the LPDDR5 memory in the Grace CPU could get an upgrade, too, to increase its memory capacity and possibly its bandwidth, Buck gave us the standard line about not talking about unannounced products.
Nvidia has not given pricing on any of the Hopper or Grace-Hopper compute engines, and is similarly not talking about whether this increased memory capacity and memory performance is being given for free. (We strongly suspect not.) He did say that Nvidia expects that system builders buying Grace-Hopper superchips will “fairly quickly” move to this second generation.
For HPC and AI applications that are memory capacity and memory bandwidth bound, the HBM3e memory will be a boon for performance on those applications.
“I expect for bandwidth limited applications to get near that 1.5X increase,” says Buck about the next-gen GH200. “We won’t always be able to push that level, of course, but it is going to be in that ballpark. What’s interesting also is the capacity increase because you can fit a larger model on a single GPU and and now with the CPU-GPU combined you effectively have almost 700 gigabytes of combined memory, so so you can do more with a single GPU. It will have increased performance, but you won’t have to necessarily require two GPUs to run a larger model.”
That’s the first announcement from Siggraph. The second one is that Nvidia has come up with a two-socket Grace-Hopper superchip that has 900 GB/sec direct NVlink ports connecting them together into a shared memory complex with two Grace CPUs and two Hopper GPUs. In fact, it is a four-way link between the machines just like we see in four-way CPU systems, so each device can reach out and talk to the memory of any of the other devices in the complex. Call it asymmetrical NUMA, if you will.
“It basically changes these two processors – there is one giant GPU, one giant CPU – int a super-sized superchip,” Jensen Huang, Nvidia co-founder and chief executive officer, explained in his keynote at Siggraph. “The CPU now has 144 cores, the GPU has 10 TB/sec of frame buffer bandwidth and 282 GB of HBM3e memory. Well, pretty much you can take any large language model you like and put it on this and it will inference like crazy. And the inference cost of large language models will drop significantly because look at how small this computer is. And you can scale this out in the world’s datacenters – you can connect it with Ethernet, you can connect it with InfiniBand.”
That bit about lowering the cost of inference is key because when you have to do inference on the same machine you do training on, it is very expensive unless you can lower the cost of the training to where you wanted the inference to be. It remains to be seen how much cheaper the Grace-Hopper approach will be compared to building machines that look like the DGX H100 servers that Nvidia makes that are based on eight-way GPU complexes using Hopper SXM5 units, all interlinked with NVSwitch fabrics.
It really comes down to how much less expensive a Grace CPU is compared to buying an X86 host based on Intel or AMD X86 processors. The latter will be more expensive, we think, but will also allow for a much larger CPU memory space and flash storage footprint, as the DGX H100 servers and their HGX clones do. We link the idea of making MGX Grace-Hopper clusters using NVSwitch fabrics to interconnect up 256 GPUs and then using InfiniBand to cross couple multiple pods together into a superpod. And it would be fascinating to see how a Grace-Hopper superpod with 256 GPUs performs against a Hopper-based DGX H100 superpod that uses Intel “Sapphire Rapids” Xeon SPs on the CPU hosts, has the same 256 Hopper GPUs, and NVSwitch inside of the fatter nodes. Given the higher bandwidth and memory capacity on the second-gen Grace-Hopper GH200s compared to the GH100 SXM5, it is not hard to figure out what would win.
And if and when the GH100 SXM5 is moved to HBM3e stacks with 141 GB and 5 TB/sec capacity, we suppose it will all come down to the nature of the AI training and inference workloads and how they react to a hierarchy of memory and networking.
Sign up to our Newsletter
Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.