Updated: We have obtained new information in the wake of publishing our story.
We have been expecting a new Arm server CPU design out of the Annapurna Labs folks who create the CPUs, XPUs, DPUs, and scale up switches for Amazon Web Services since the launch of the Graviton4 processor for two-socket systems two years ago.
The Graviton4, based on the “Demeter” V2 core just like the Nvidia “Grace” CG100 processor, was the first server CPU created by AWS that had NUMA clustering, allowing for two CPUs to share memory and present a single memory space and compute complex to the operating system. But as Dave Brown, vice president of compute and machine learning services at AWS explained in the opening keynote at the re:Invent 2025 conference today, having two processors sharing memory across a set of NUMA links introduced a lot of latencies for applications as did not having enough L3 cache for the Graviton4 cores to use as a DRAM cache. And this, as well as other factors, made applications run slower than you might expect when they had a total of 192 Neoverse V2 cores to support them.
So with the Graviton5, which is now in technology previewed with selected AWS customers, the Annapurna Labs team seems to have scratched the NUMA and put 192 Arm cores on a single socket. And now the bottleneck shifts back to the memory and memory bandwidth balance reckoned against those 192 cores, because now there is half as much memory capacity and perhaps a little more than half as much memory bandwidth against what we presume are “Poseidon” Neoverse V3 cores inside that single Graviton5 socket. (See Arm Neoverse Roadmap Brings CPU Designs, But No Big Fat GPU for more on the Neoverse core and chip roadmaps from Arm.)
There is nothing, we think, that precludes AWS from creating a two-socket NUMA version of Graviton5, of course, and that may eventually happen if customers need such configurations. (And we think that some will.)
Brown did not give out much in the way of feeds and speeds for Graviton5. We know Graviton5 has 192 cores in a single socket, 2X the cores of the Graviton4 CPU, but only delivers about 25 percent more performance. We also know that Graviton5 has 2.67X the amount of L3 cache per core as Graviton4 and has 5.3X the L3 cache per chip as Graviton4. We think that Graviton5 is etched in the same 3 nanometer processes from Taiwan Semiconductor Manufacturing Co as the current Tranium3 XPU that is now shipping in volume inside UltraServer clusters.
Brown also briefly showed a block diagram of Graviton5, which we snapped quickly but which is still blurry given how far away the camera was from the backdrop screen on stage:
If you sort of squint at that, you can see 96 pairs of Arm cores in the center of the chip, with a mesh interconnect between them. There are four PCI-Express 6.0 controllers across the top of the chip and four more across the bottom, which should be 96 lanes in total and 2.84 TB/sec full duplex at twelve lanes per PCI-Express controller.
On the right and left edges of the chip you see six DDR5 memory controllers on each side, for a total of twelve DDR5 memory controllers for the whole Graviton5 socket. If AWS used DDR5-6400 memory running at 6.4 GHz, a single Graviton5 chip would have 614.4 GB/sec of memory bandwidth, a 14.3 percent increase compared to Graviton4. That doesn’t seem like a lot, and as we had hoped, AWS is in fact using DDR5-7200 memory with Graviton5, which delivers 691.2 GB/sec of bandwidth in the socket, which is a 28.6 percent increase compared to the 537.6 GB/sec of the Graviton4. However, two Graviton4s had twice the memory capacity and 55.6 percent more bandwidth as a single Graviton5, so some things are given up when moving 192 cores back to a single socket.
When we were fantasizing about what AWS might do with Graviton5’s main memory, we had hoped it would push up to 16 controllers on the socket, which would have delivered 819.2 GB/sec at 6.4 GHz. Instead, AWS is pushing up memory speeds, and the Graviton5 will support DDR5-8400 memory running at 8.4 GHz, and that will deliver 806.4 GB/sec in a single socket, which is 75 percent of the aggregate 1,075.2 GB/sec of a dual-chip Graviton4 setup.
Brown did not speak about the Graviton5 core at all, but we have since confirmed that the core is based on the Poseidon Neoverse V3 core, which implements the Arm-V9.2-A enhancements. Because of Brown saying that the Graviton5 core delivered 25 percent more oomph than the Graviton4 core, we presumed it was a massively geared down 192-core chip with a mere 1.75 GHz clock speed. But, as it turns out, AWS was talking about a two-socket Graviton4 machine compared to a one-socket Graviton5 machine, and it is now clear that the NUMA Graviton4 implementation was a stopgap maneuver until Graviton5 chip could come into the market.
The Poseidon V3 core allows 2 MB or 3 MB of L2 cache per core, and we opted for the fatter one in our table; it turns out to be 2 MB in actuality. We think the L1 instruction and data caches will stay at 64 KB each inside each core.
Here is how the six different Graviton chips stack up on the feeds and speeds:
When we do our estimating, we think the Graviton5 complex has around 132 billion transistors and burns about 180 watts running at our original and hypothetical 1.75 GHz and around 650 watts running at what we presume is its actual speed of 3.1 GHz.
We envision that Graviton5 does not just have PCI-Express 6.0 controllers, but also has variations that will have NVLink Fusion and UALink ports as well to directly link into GPU and XPU compute engines to share memory.
Brown said that M9g instances using Graviton5 and aimed at general purpose workloads are in preview now. C9g instances aimed at compute-intensive jobs and R9g instances aimed at memory intensive jobs are expected to be unveiled in 2026.




A nice upgrade of the Graviton from Neoverse V2 to V3, including PCI-Express 6.0 that enables CXL 3.0, but if they had to reduce clock speed this much to get there then it is not that much of a great move.
It will be for AWS to clarify this I think, but possibly the 25% performance uplift is that from the V2 to V3 move as the average of bars in the uplift chart (from “Crypto” to “AI data analytics”) in the previous “Arm Neoverse Roadmap” TNP piece would give a 23% improvement already (on a per core basis). And also, the Ampere A192-32X and A192-32M can run 192 cores at 3.2 GHz with 283 and 348 Watts of TDP, without melting … and 283 W is not that far off the 240 Watts of Graviton 3E (A192-26X is already 192 cores at 2.6 GHz in 211 Watts).
The increase of L2 and L3 cache, speedup of DDR5, and reduced latency from grouping the 192 cores are nice, but it is disappointing if that came at the cost of a huge clock reduction. Hopefully we get more detailed specs from AWS on these chips soon!
It is an M8g instance compared to an M9g instance, which we did not think was the case as they were talking Graviton4 versus Graviton5. Confirmation came later than the deadline, so we jumped the gun on this one and nonetheless thought it was an odd choice. it is fixed now.
Personally, I’d understand the AWS announcement as 25% higher performance on the same instance size, ie I’d expect an m9g.16xl to have 25% higher performance than an m8g.16xl, not lower. You should really check this with AWS.
Well, it is. And I have refactored the story to reflect this and other details that have come to light.