We were under the distinct impression that AMD was not going to talk much about its datacenter compute engines at the Consumer Electronics Show, having just launched its “Genoa” Epyc 9004 server CPUs in November with much fanfare.
But it is the week before Intel is set debut – at long last – its “Sapphire Rapids” Xeon SP server CPUs, and at the last minute AMD decided that Lisa Su, its chairman and president, would use her annual CES keynote to talk about datacenter products other than the Epyc 9004s, and thus we got a little bit more information about the future (and as-yet still uncodenamed) Instinct MI300 hybrid CPU-GPU accelerators as well as a forthcoming Alveo A70 matrix math engine accelerator derived from the hard-coded AI engines in the Xilinx “Everest” Versal FPGAs.
AMD has been talking about so-called “accelerated processing unit” or APU devices for more than a decade, and has been making hybrid CPU-GPU chips for desktop, laptop, and game console machines for many years. And it has always aspired to offer a big, bad APU for datacenter compute. In fact, in the wake of the initial “Naples” Epyc 7001 CPU launch in 2017, AMD was expected to launch an APU that combined elements of this CPU and a Radeon “Vega” GPU together on a single package. For whatever reason – very likely the cost of the integration and the lack of a software platform to make it easily programmed – AMD spiked the effort and didn’t talk about datacenter APUs again except in the abstract.
But that did not mean that AMD was not working behind the scenes on an APU for the datacenter, as it most certainly was when it won the contract to supply the CPU and GPU compute engines for the future “El Capitan” exascale-class supercomputer for Lawrence Livermore National Laboratory, which will be installed later this year. In fact, Lawrence Livermore, AMD, Cray, and Hewlett Packard Enterprise all obfuscated the fact that the El Capitan machine would be using an APU and not a collection of discrete AMD CPUs and GPUs as does the “Frontier” supercomputer at Oak Ridge National Laboratory that was installed last year. Diagrams for the El Capitan machine actually show discrete devices – the same one CPU to four GPU ratio that Frontier has, in fact. No matter. We understand why the truth was bent. And it is no surprise that Intel was suddenly working on its “Falcon Shores” hybrid CPU-GPU compute engines for HPC and AI workloads and not just talking about discrete Xeon SP CPUs and discrete Max Series GPUs codenamed “Ponte Vecchio” and “Rialto Bridge”.
AMD talked a bit about the Instinct MI300A, as the APU variant of the MI300 series is apparently going to be called, in some detail back in June last year. We think that there will be discrete, PCI-Express 5.0 variants of the MI300 series, although AMD has said nothing about this. But it is clear that Intel, AMD, and Nvidia will continue to sell discrete CPUs and GPUs as well as what AMD calls an APU and what Nvidia calls a “superchip.” The reason is simple: the ratios of CPU and GPU compute that Lawrence Livermore and other HPC/AI centers need is not necessarily going to work for all — or maybe even many — workloads. Other customers will need different ratios. When we get chiplets down to a science the SKU stacks of all compute engine makers will vary these as well as allowing custom variations for a price.
Six months ago, AMD said that the MI300 would offer 8X the AI performance of the MI250X GPU accelerator used in the Frontier machine, and that could be accomplished fairly easily just by putting four of the “Aldebaran” GPU chiplets on a single package and shifting to FP4 eighth-precision floating point math. (The MI250X has two Aldebaran GPU chiplets and bottoms out at FP16 and BF16 half-precision floating point.) Or, it could mean four GPU chiplets, each with twice as many cores and supporting the FP8 format that Nvidia supports in its “Hopper” H100 GPUs and that Intel supports in its Gaudi2 accelerators.
The big point in the MI300 is that the CPU and GPU are on a single package, using 3D packaging techniques and sharing the same HBM memory space. There is no data movement between the CPUs and the GPUs within the package – they literally share the same memory. This will apparently simplify the programming of hybrid computing, and Su said as much as she promised a “step function” increase in performance above the 2 exaflops already delivered with the CPU-GPU complexes in the Frontier system.
“To accomplish this, we have been developing the world’s first datacenter processor that combines a CPU and a GPU on a single chip,” Su said in her keynote, and clearly she meant package and not chip. “Our Instinct MI300 is the first chip that brings together a datacenter CPU, GPU, and memory into a single, integrated design. What this allows us to do is share system resources, or the memory and I/O, and it results in a significant increase in performance and efficiency, as well as it is much easier to program.”
Nvidia will ship its “Grace” Arm CPU and “Hopper” H100 GPU superchips before the MI300A ships from AMD, but the distinction is that the Grace CPU on the superchip has its own LPDDR5 main memory and the Hopper GPU on the superchip has its own HBM3 stacked memory. They have coherent memory – meaning they can move data between the devices quickly and share it over an interconnect – but it is not literally the same physical memory being used by either device and therefore not requiring data movement between two blocks and types of memory. (We can debate which will be the better approach later when HPC and AI centers are coding for Grace-Hopper and MI300.)
During the keynote, Su gave out a few more details about the MI300A APU:
We thought the MI300A APU would have 64 cores, like the custom “Trento” Epyc 7003 processor used in the Frontier system, and possibly cut that down to 32 cores if the heat was getting to be too much on the device. But it turns out that the MI300A will only have 24 of the Zen 4 cores used in the Genoa Epyc 9004s. The Zen 4 cores provide 14 percent better IPC than the Zen 3 cores used in the Epyc 7003s, so 56 cores running at the same clock speed would have provided equivalent performance on integer workloads. But the floating point units in the Zen 4 cores do about twice the work as those in the Zen 3 cores, so 24 Zen 4 cores will yield about the same performance on FP64 and FP32 work as those 56 Zen 3 cores, depending on the clock speed of the former, of course. (That floating point bump comes from memory bandwidth increases as much on support for AVX-512 instructions from the Intel Xeon SP architecture.)
AMD says that there are nine 5 nanometer chiplets and four 6 nanometer chiplets on the MI300A package, with HBM3 memory surrounding it. Here is what the package looks like as rendered:
And here is a very tight zoom onto the package:
That sure looks like six GPU chiplets, plus two CPU chiplets, plus an I/O die chiplet on the top, with four underlying chiplets that link two banks of HBM3 memory to the complex at eight different points and to each other. That would mean AMD re-implemented the I/O and memory die in 5 nanometer processes, rather than the 6 nanometer process used in the I/O and memory die in the Genoa Epyc 9004 complex. We strongly suspect that there is Infinity Cache implemented on those four 6 nanometer connecting chiplets, but nothing was said about that. It does not look like there is 3D V-Cache on the CPU cores in the MI300A package.
The Genoa Epyc 9004 is comprised of compute complex dies (CCDs) that are etched using Taiwan Semiconductor Manufacturing Co’s 5 nanometer processes, and this looks to be the case here. Those Genoa CCDs have eight cores and 32 MB of L3 cache each.
Depending on how we interpret the MI300A rendering, there are too few or too many cores. We suspect that there are some dud cores and this complex actually has 32 physical Zen 4 cores on it, but only 24 are active. If you put a gun to our head, we would guess that there are four times as many GPU compute engines on this chip complex spread across six chiplets (rather than the rumored four chiplets and significantly higher than the two chiplets in the MI250X) with FP8 precision as the lowest precision. That gets the CDNA 3 GPU architecture to that 8X figure for AI performance cited above.
Su said that the entire MI300A complex cited above had an incredible 146 billion transistors.
Now, let’s talk about that 5X better performance per watt figure Su and others have talked about. The MI250X runs at 560 watts to deliver peak performance, and if you do the math, if the MI300A has 8X the performance and 5X better performance per watt, then that means the MI300A complex will weigh in at 900 watts. That is presumably including that 128 GB of HBM3 memory, which can run pretty hot across eight stacks.
Su said that right now, it takes many months across thousands of GPUs to train AI foundation models, and many millions of dollars in electricity costs alone, and added that the MI300A devices would allow companies to do their training in weeks instead of months and save an enormous amount of time and energy – or train even larger models for the same cash.
The Instinct MI300A is back from the foundry and in the labs now, and will ship in the second half of this year. Lawrence Livermore is first in line, of course.
That leaves the Alveo A70, about which Su said very little.
The Alveo A70 is taking the DSP matrix math engines from the Versal FPGAs and plunking a lot of them down onto a new piece of silicon aimed just at doing AI inference. (The Ryzen PC chip was just augmented with these same AI matrix math engines.) This particular device plugs into a PCI-Express 5.0 slot, burns only 75 watts – a magic number for inference accelerators – and delivers 400 tera operations per second (TOPS) of AI performance, presumably at INT8 precision but it could be FP8 or even INT4. AMD did not say. What Su did say is that compared to the Nvidia T4 accelerator – which is now a generation behind given that Nvidia launched its “Lovelace” L40 accelerators in September 2022 – the Alveo A70 could outrun it by 70 percent on smart city applications, by 72 percent on patience monitoring applications, and by 80 percent on smart retail applications that had AI inference as part of the workload.
The Ryzen AI engines used in the Ryzen 7040 series PC processors weigh in at 3 TOPS each at whatever the clock speed is for these processors. If the AI engines ran at the same clock speed — probably somewhere around 3 GHz — in the Alveo 70 accelerators, it would take around 135 of them to reach 400 TOPS of aggregate performance. It is far more likely that the Alveo 70 has a slower clock speed — maybe somewhere between 1 GHz and 1.5 GHz — and so it might have somewhere between 250 and 400 of these AI engines inherited from the Xilinx FPGAs.