Nvidia Embraces The CPU World With “Grace” Arm Server Chip

Within a year or so, with the launch of the “Grace” Arm server CPUs, it will not be heresy for anyone at Nvidia to believe, or to say out loud, that not every workload in the datacenter needs to have GPU acceleration.

In a way, with the adoption of the BlueField line of Arm-based DPU processors, this transformation is already happening with Nvidia’s system architecture. But with the release of the Grace CPUs, which were previewed at the GTC 2021 conference a year ago, sometime in the first half of 2023, if all goes well, Nvidia will immediately be a credible supplier of Arm server chips that can compete in terms of performance per watt and, we presume, in cost per dollar per watt, with the best that X86, Arm, or Power architectures out there at the same time. It will be an important event for Nvidia to shift from being an GPU accelerator supplier to a host CPU supplier – and host CPUs that have plenty of their own vector math oomph and that – and this the very important thing – will be able to run the entire Nvidia HPC and AI stack, including compilers, libraries, and other systems software. The Grace CPU will be a full peer to the Hopper GPU, something that the COBOL inventing former US Navy Rear Admiral would perhaps approve of in a metaphorical sense.

We have been advocating for over a decade for Nvidia to bring an Arm server chip to market, and were excited by the possibilities when Jensen Huang, Nvidia co-founder and chief executive officer, announced the “Project Denver” Arm server initiative back in January 2011, which was when the first wave of Arm server chips tried to crash through the datacenter doors. In 2014, when two of the Denver Arm cores appeared on Tegra K1 “superchip” hybrid CPU-GPU chip, the word on the street was that Nvidia had come up with its own CPU instruction set architecture (ISA) and was emulating the Arm ISA on top of that and, significantly, was able to emulate the X86 ISA as well. (Transmeta tried to do something like this two decades ago, remember?) Imagine the lawsuits if Nvidia had launched a full-on Denver server chip that could emulate a Xeon or Opteron (and now Epyc) and also run Arm workloads and, perhaps, its own native mode. . . .

But alas, we had to wait another dozen years, and for the Nvidia deal to acquire Arm Holdings from SoftBank for $40 billion to fall through, for a much cleaner Arm server chip plan to emerge for Nvidia. And we think it is the plan that Nvidia originally had before the SoftBank proposition was proposed – we joked with Huang that we wanted an Arm server chip from Nvidia but Nvidia didn’t have to Victor Kiam it and buy the whole company.

That said, we understood the once-in-a-lifetime opportunity to buy the whole Arm – for about an arm and a leg financially speaking but it was mostly Nvidia stock, which spends like cash but isn’t actually cash. And we also fully understood the implications of the future Aarmv9 architecture and the fact that a lot of machine learning work – certainly most of inference and possibly some training – would remain on CPUs and would not migrate to GPUs or other accelerators for that matter. That fact, as we said when we were poring over the Armv9 announcement in March 2021, just a few weeks after the Grace effort was reveled and eight months after the Arm Holdings deal was announced, was why Nvidia wanted to buy Arm: It could get the licensing money for vector, matrix, and digital signal processing intellectual property that would be added to all kinds of CPUs precisely because system architects did not want to do a GPU offload.

There are coding and security implications to doing offloads of any kind – encryption accelerators, bump in the wire FPGA accelerators, or GPU accelerators – and many enterprises and organizations do not want to deal with them. Those who need 10X or 100X better AI performance and 10X better HPC performance have little choice but to use GPUs – unless they want to make a custom CPU with a lot of cores with lots of vector engines. Fujitsu has done this with the A64FX Arm CPU for the “Fugaku” supercomputer at RIKEN Lab in Japan, and the National Research Center of Parallel Computer Engineering and Technology has done this with the Sunway SW26010 and SW26010-Pro processors for the respective “TaihuLight” and “OceanLight” supercomputers at the National Supercomputing Center in Wuxi. And with impressive results, we might add. But high price/performance and low power draw are not the hallmarks of either of these machines. (Fugaku topped the Green500 supercomputer rankings three years ago, but has been pushed down the list by a long line of machines accelerated by Nvidia “Ampere” A100 GPU accelerators. The “Hopper” H100 GPU accelerators will only make these comparisons worse, and as far as we know there is no A64FX-2 chip in the works with a process shrink, a clock speed increase, a wattage reduction, or a price/performance improvement over the A64FX.

Despite all of that, many enterprises and organizations will still make the choice between having to pay untold millions of dollars to rip their C, C++, and Fortran codes apart to do GPU offload or paying more for electricity and taking longer to come up with answers and running AI workloads on a zippy CPU that can do matrix and vector math reasonably well with memory subsystems that are bandwidth challenged but which have gobs and gobs of memory compared to the skinny sprinter HBM memory of the GPU accelerator.

That is one reason why the Grace CPU is so important for Nvidia, and so is the declaration that Grace will run all of the software that Nvidia has created to run on GPUs.

Let’s get that in writing, straight from the top in Huang’s GTC 2022 keynote: “Grace will be amazing at AI, data analytics, scientific computing, and hyperscale computing and Grace will be welcomed by all of Nvidia’s software platforms – Nvidia RTX, HPC, Nvidia AI, and Omniverse.”

Paresh Kharya, senior director of accelerated computing at Nvidia, was a little more explicit about this, because software being welcomed on Grace is not the same thing as software running on Grace: “We are executing on plan with our on CPU roadmap, and the Grace CPU will also run all of Nvidia’s computing stack, including Nvidia RTX, HPC, Nvidia AI, and Omniverse, and this is a continuation of our over a decade long journey with Arm CPUs in our products, including a key milestone three years ago when we announced bringing CUDA and our full stack of HPC and AI software to Arm.”

So that’s the first important new thing we have learned about the Grace CPU. It can standalone if customers want it that way, and it can do any kind of computation that a GPU can as far as Nvidia is concerned.

The second important thing is that hybrid CPU-GPU systems running either HPC or AI workloads need host processors, and they need tighter coupling between the CPUs and the GPUs and they need coherent memory that is based on as similar technologies as possible in these two devices – something that, to date, has not been done. Big Blue demonstrated with the Power9 chip, which had a pair of NVLink ports on it, that it could share memory between a network of Nvidia V100 GPUs with HBM2 memory and the DRAM on the Power9 chip relatively seamlessly. But the bandwidth was not all that balanced. The CPU has relatively slow access to its own memory, and it becomes a kind of DRAM controller for the GPU clusters in a machine, which can speak a hell of a lot faster to each other. The difference is 64 GB/sec for CPU memory bandwidth into the GPU passing into the GPUs versus 8,000 GB/sec of bandwidth when the GPUs are talking to each other. (This was a hypothetical comparison using HBM2e memory running at 2 TB/sec per card, not the HBM3 memory in the top-end Hopper H100 package that runs at 3 TB/sec.)

With the combination of the Grace-Hopper hybrid chips, as we showed a year ago, the Grace CPUs link to their low power DDR5 memories (which are on-package mounted like the HBM memory on a GPU accelerator) with NVLink ports adding up to 500 GB/sec of aggregate bandwidth, and there are 500 GB/sec NVLink ports between collections of Grace CPUs so they can share data and also 500 GB/sec links between a Grace CPU and a Hopper GPU. This is what Nvidia is now calling NVLink Chip to Chip, or NVLink C2C for short, which it is making available for license for the first time so other chips can be equipped with it. And once again, as we said a year ago: This architecture Nvidia might be creating NVLink memory, with NVLink SerDes being linked directly to some kind of buffered LPDRR5 memory, much as IBM is using its own signaling SerDes in Power10 chips as NUMA , I/O, and memory links with buffered DDR5 memory.

It is not clear if Nvidia will sell the Grace CPU as a standalone product for hyperscalers, cloud builders, OEMs, or ODMs to create their own systems. At this point, we know there are two different variations of the Grace that are coming to market in what looks like a variant of the SXM5 form factor used in the high-end Hopper GPU accelerator:

On the left in the chart above is the Grace-Hopper module, combining the CPU and the GPU into a single package tightly coupled by NVLink, and on the right are a pair of Grace CPUs, each with 72 cores, 512 GB of main memory and 500 GB/sec of main memory bandwidth.

If you look at the Grace die shots very carefully below. . .

. . . then you will see that each Grace die has four quadrants of cores. Two of the quadrants have 18 cores each and two of the quadrants have 24 cores each, which is a weird ratio but there it is. That works out to 84 cores per die, with what looks like an easy expansion to 96 cores with the addition of another row of 12 cores across two quadrants. The dies are rotated 180 degrees from each other in the mock-up above, and that might be significant for balance reasons across the package.

Each of the Grace dies has eight banks of LPDDR5X memory, which we presume comes from Samsung, and if that is the case, it runs at 4.23 GHz and appears to be delivering 62.5 GB/sec of bandwidth across eight memory channels. The LPDDR5X memory that Nvidia is mounting on the Grace package has ECC error detection and correction scrubbing on it, which is a must for server workloads. So across that Grace-Grace pair, there is 1 TB of memory and an aggregate 1 TB/sec of memory bandwidth between the CPU dies and their main memories. (There is a single 900 GB/sec NVLink port between the two Grace dies as far as we know.) There is also 396 MB of L3 cache memory across the two Grace chips, which is 198 MB per Grace chip and 2.75 MB per core. If the yield on the cache is 100 percent, that is. If the yield on the cache is not 100 percent, as the yield on the CPUs is not with only 72 of the 84 cores active, then there is actually 231 MB of physical L3 cache on the Grace die.

It is not clear what the cores are within the Grace CPU, but we know for sure that they implement the Armv9 instruction set and, we think, will be one of the first CPUs on the market to do so. (We expect an Armv9 Graviton4 to come out in November of this year if Amazon Web Services keeps to its annual cadence of homegrown CPU announcements, with shipments into its cloud starting in early 2023.)  But it seems very unlikely that the “Poseidon” platforms and their N3 and V2 cores (those are our designations for the successors to the “Zeus” V1 Armv8.9 cores used in the Gravitron3 and the “Perseus” N2 Armv9 cores no one has shipped as yet) will be ready to be plunked down into the Grace die. But, there is always an outside chance that Nvidia will create a custom Armv9 core that has two 256-bit wide SVE2 vectors and also uses other the Armv9 features as well. Nvidia does not have to wait for Arm to put Poseidon cores into the field, after all, and it could be doing all kinds of custom ISA stuff, too, as it was doing in Project Denver so many years ago.

Don’t assume they are just going to be the Perseus N2 cores is what we are saying. And going forward, even if the Grace 1 chip does use the N2 cores, do not assume that Grace 2 will not be a custom core. Nvidia is very big on customization. We are reasonably certain that Grace will be implemented in a 5 nanometer process by Taiwan Semiconductor Manufacturing Corp – slightly fatter than the custom 4N process that Nvidia used on the Hopper GPU – but also don’t be surprised if Grace is also implemented in the custom 4N process to shrink the die and goose the performance.

With the Grace-Grace double-whammy module, Nvidia is projecting that the chip will deliver a rating of over 740 on the SPECrate2017_int_base integer benchmark. It is tough to guess where the clock speed might be on the Grace unit, but with only a 500 watt power draw for two CPUs (including the memory), we expect it to be somewhere around 2 GHz, maybe as high as 2.3 GHz. If that is the case, then those two 128-bit SVE2 FMA vector units can do 8 flops per clock per core, which is 2.3 teraflops at FP64 double precision on floating point math running at 2 GHz and 2.65 teraflops at 2.3 GHz. That ain’t huge, mind you. But it is competitive with many other CPUs, particularly those aimed at hyperscalers. That said, we think there is a very good chance that Nvidia will want to have a pair of 256-bit SVE2 vectors in its Grace core to double the floating point performance. That would put it on par with Graviton3 from AWS, which is using the “Zeus” V1 core aimed at HPC and AI workloads.

We shall see.

That brings us to the Grace-Hopper hybrid CPU-GPU module:

This is basically a complete accelerated compute unit on a card. It doesn’t need anything else but flash storage for its software and scratchpad and a link to the outside world. NVLink ports running at 900 GB/sec are there by default. It is not clear which Hopper chip will be used in the Grace-Hopper module, but we strongly suspect it will be a geared down version of the GPU as was used in the PCI-Express 5.0 version of the Hopper H100 GPU accelerator, which has 2 TB/sec of bandwidth on its 80 GB of HBM3 memory. This matches the outlined that Nvidia set out for Grace a year ago, and only burns 350 watts for the GPU and its HBM3 memory stacks. That implies that the Grace-Hopper package will burn about 600 watts and have 592 GB of total memory – just shy of the 600 GB shown in the chart above, but Nvidia is rounding up.

The thing to remember, and which Huang showed during his keynote, is that the ratios of Grace GPUs and Hopper GPUs are not static. That would be extremely limiting when it comes to system architecture because not all workloads have the same ratios of CPUs to GPUs. Here are some of the possibilities that Huang showed off:

You will probably have to click on that image above to see it well.

On the left is a Grace module with a 400 Gb/sec ConnectX-7 adapter, and it is safe to assume that each module of computation will need its own network interface if it is sharing data across the system. NVLink will be used to lash these components together within a node, and it would be interesting to see Nvidia come up with composability software running atop NVSwitch inside the box and NVLink Switch across the boxes to make racks of CPU and GPU modules composable. (We are going to have a think on this.)

The interesting one in the chart above shows a freestanding Grace CPU with 512 GB of memory attaching to two independent Hopper GPUs in the SXM5 form factor. This looks like a MiniITX-style board. After that it is just combinations of one Grace-Grace module with two, four, or eight SXM5 versions of the Hopper GPU. We presume that each pair of GPUs will need an NVSwitch 3 ASIC to link the CPUs to the GPUs, and the links between the Grace-Grace module and the GPUs might also need another NVSwitch ASIC. (We discussed the new NVSwitch and NVLink Switch devices and topologies in this story.) It is not clear, but we are going to find out and do a follow up.

Coming up next, we will do competitive analysis on the Hopper GPUs – both raw feeds and speeds and benchmarked performance – and also the deep dive on the Hopper architecture.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

2 Comments

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.