Updated: We have obtained new information in the wake of publishing our story.
We have been expecting a new Arm server CPU design out of the Annapurna Labs folks who create the CPUs, XPUs, DPUs, and scale up switches for Amazon Web Services since the launch of the Graviton4 processor for two-socket systems two years ago.
The Graviton4, based on the “Demeter” V2 core just like the Nvidia “Grace” CG100 processor, was the first server CPU created by AWS that had NUMA clustering, allowing for two CPUs to share memory and present a single memory space and compute complex to the operating system. But as Dave Brown, vice president of compute and machine learning services at AWS explained in the opening keynote at the re:Invent 2025 conference today, having two processors sharing memory across a set of NUMA links introduced a lot of latencies for applications as did not having enough L3 cache for the Graviton4 cores to use as a DRAM cache. And this, as well as other factors, made applications run slower than you might expect when they had a total of 192 Neoverse V2 cores to support them.
So with the Graviton5, which is now in technology previewed with selected AWS customers, the Annapurna Labs team seems to have scratched the NUMA and put 192 Arm cores on a single socket. And now the bottleneck shifts back to the memory and memory bandwidth balance reckoned against those 192 cores, because now there is half as much memory capacity and perhaps a little more than half as much memory bandwidth against what we presume are “Poseidon” Neoverse V3 cores inside that single Graviton5 socket. (See Arm Neoverse Roadmap Brings CPU Designs, But No Big Fat GPU for more on the Neoverse core and chip roadmaps from Arm.)
There is nothing, we think, that precludes AWS from creating a two-socket NUMA version of Graviton5, of course, and that may eventually happen if customers need such configurations. (And we think that some will.)
Brown did not give out much in the way of feeds and speeds for Graviton5. We know Graviton5 has 192 cores in a single socket, 2X the cores of the Graviton4 CPU, but only delivers about 25 percent more performance. We also know that Graviton5 has 2.67X the amount of L3 cache per core as Graviton4 and has 5.3X the L3 cache per chip as Graviton4. We think that Graviton5 is etched in the same 3 nanometer processes from Taiwan Semiconductor Manufacturing Co as the current Tranium3 XPU that is now shipping in volume inside UltraServer clusters.
Brown also briefly showed a block diagram of Graviton5, which we snapped quickly but which is still blurry given how far away the camera was from the backdrop screen on stage:
If you sort of squint at that, you can see 96 pairs of Arm cores in the center of the chip, with a mesh interconnect between them. There are four PCI-Express 6.0 controllers across the top of the chip and four more across the bottom, which should be 96 lanes in total and 2.84 TB/sec full duplex at twelve lanes per PCI-Express controller.
On the right and left edges of the chip you see six DDR5 memory controllers on each side, for a total of twelve DDR5 memory controllers for the whole Graviton5 socket. If AWS used DDR5-6400 memory running at 6.4 GHz, a single Graviton5 chip would have 614.4 GB/sec of memory bandwidth, a 14.3 percent increase compared to Graviton4. That doesn’t seem like a lot, and as we had hoped, AWS is in fact using DDR5-7200 memory with Graviton5, which delivers 691.2 GB/sec of bandwidth in the socket, which is a 28.6 percent increase compared to the 537.6 GB/sec of the Graviton4. However, two Graviton4s had twice the memory capacity and 55.6 percent more bandwidth as a single Graviton5, so some things are given up when moving 192 cores back to a single socket.
When we were fantasizing about what AWS might do with Graviton5’s main memory, we had hoped it would push up to 16 controllers on the socket, which would have delivered 819.2 GB/sec at 6.4 GHz. Instead, AWS is pushing up memory speeds, and the Graviton5 will support DDR5-8400 memory running at 8.4 GHz, and that will deliver 806.4 GB/sec in a single socket, which is 75 percent of the aggregate 1,075.2 GB/sec of a dual-chip Graviton4 setup.
Brown did not speak about the Graviton5 core at all, but we have since confirmed that the core is based on the Poseidon Neoverse V3 core, which implements the Arm-V9.2-A enhancements. Because of Brown saying that the Graviton5 core delivered 25 percent more oomph than the Graviton4 core, we presumed it was a massively geared down 192-core chip with a mere 1.75 GHz clock speed. But, as it turns out, AWS was talking about a two-socket Graviton4 machine compared to a one-socket Graviton5 machine, and it is now clear that the NUMA Graviton4 implementation was a stopgap maneuver until Graviton5 chip could come into the market.
The Poseidon V3 core allows 2 MB or 3 MB of L2 cache per core, and we opted for the fatter one in our table; it turns out to be 2 MB in actuality. We think the L1 instruction and data caches will stay at 64 KB each inside each core.
Here is how the six different Graviton chips stack up on the feeds and speeds:
When we do our estimating, we think the Graviton5 complex has around 132 billion transistors and burns about 180 watts running at our original and hypothetical 1.75 GHz and around 650 watts running at what we presume is its actual speed of 3.1 GHz.
We envision that Graviton5 does not just have PCI-Express 6.0 controllers, but also has variations that will have NVLink Fusion and UALink ports as well to directly link into GPU and XPU compute engines to share memory.
Brown said that M9g instances using Graviton5 and aimed at general purpose workloads are in preview now. C9g instances aimed at compute-intensive jobs and R9g instances aimed at memory intensive jobs are expected to be unveiled in 2026.