Arm Comes Full Circle With Homegrown, AI-Tuned Server CPU
It has been nearly five decades since British workstation maker Acorn Computer was founded, and nearly four decades since Acorn RISC Machines – what we have known variously as Arm, Arm Ltd, or Arm Holdings – was set up independently and eventually spun out as a separate public company in 1998.
Since 2016, Japanese conglomerate SoftBank has owned most or all of Arm, depending on whether the latter had some of its shares traded publicly or not. SoftBank has aspirations as a supplier of GenAI chippery, as evidenced noted by its acquisition of independent Arm server CPU maker Ampere Computing for $6.5 billion in March 2025, and also as a funding source for the AI model builders. Even after Arm went public again in September 2023, SoftBank still owns around 90 percent of Arm’s shares, which have a total market capitalization (including a public float) of $164.3 billion after a 15 percent pop in the stock after Arm’s AGI CPU was launched yesterday.
That Arm has finally seen that it absolutely has to create its own CPUs – and possibly other kinds of chips used in the datacenter as well as CPUs and other chips used in edge and personal devices – is intuitively obvious. The whole model of Ampere Computing, which we have followed since its inception, was to be a second source of datacenter-class Arm server CPUs to the hyperscalers and cloud builders who were designing their chips and hiring the likes of Broadcom and Marvell to shepherd those designs through manufacturing and packaging into finished products.
According to Mohamed Awad, executive vice president of Cloud AI at Arm, who has been steering the Neoverse server CPU IP blocks effort since its launch in October 2018 and the Compute Sub System (CSS) effort that followed it in August 2023 to provide more finished CPU designs, customers have been asking Arm to provide not just complete CPU designs, but finished CPU parts, and the first one to ask for such a thing three years ago, which resulted in the creation of the Arm-supplied AGI CPU, was Meta Platforms, and OpenAI is right behind it. Meta Platforms is the one hyperscaler that is not also a cloud, and given this, Meta Platforms has more degrees of freedom when it comes to CPUs, GPUs, XPUs, DPUs, and switch ASICs, and as a model builder that is also not a cloud, OpenAI has similar freedom. (Which is why you see these two companies working on all kinds of things, all the time.)
Ampere Computing was getting some traction selling its AmpereOne chips into the hyperscalers and cloud builders, but once the company was acquired by SoftBank, it went kind of quiet. We saw the 192-core “Polaris” AmpereOne M chips, which started shipping in Q4 2024, appearing in instances on Oracle Cloud Infrastructure, and that was about it. We strongly suspect that Ampere Computing was acquired to give Arm a second chip design team so it can drive its homegrown CPU roadmap, which has an annual cadence. SoftBank has disclosed nothing about how it is going to mix and match Arm and Ampere Computing, as far as we know, and the top brass at Arm said nothing about it yesterday except that the AGI CPU effort started about three years ago at the behest of Meta Platforms.
Rene Haas, the chief executive of Arm shown above holding the first generation AGI CPU that is sampling now and that will ship in volume later this year to Meta Platforms and OpenAI and anyone else who wants to place an order, laid out the math for why CPUs still matter in agentic AI datacenters. The “why Arm?” question is a foregone conclusion, with all of the hyperscalers and big cloud builders having long-since designed their own Arm server CPUs. With Nvidia shipping most of the GPUs in the world, and every one of those NVL72 rackscale systems being based on “Grace” CG100 Arm CPUs, Arm is the default CPU architecture for the host side of big AI nodes. The hyperscalers and cloud builders all want to pair their Arm CPUs to Nvidia GPUs, AMD GPUs, and homegrown XPUs, and in some cases they will use licensed NVLink Fusion ports and in others they will use UALink or ESUN over Ethernet.
What matters to Arm is that over 350 billion Arm chips have been shipped since Arm Holdings was spun out of Acorn all those years ago, and there are tens of billions of more potential shipments in the near term because every GPU or XPU needs a hell of a lot of CPU cores, either in the host system or in the DPUs that virtualize networking or act as distributed storage controllers these days.
In fact, says Haas, in a modern AI datacenter with 1 gigawatt of power, which as we have calculated before would have somewhere on the order of 500,000 to 600,000 accelerators, there are also 30 million cores, which probably averages somewhere around 300,000 CPUs assuming somewhere around 100 cores per CPU (some are more, some are less). But with agentic AI, where agents backed by dozens of models are going to be querying inference models at more than 15X the rate of humans using chattybots, more and more CPUs are going to have to be added to the AI inference systems. Haas says a conservative estimate is at least 120 million cores per gigawatt. With the core counts rising, that might represent well north of 1 million CPUs, and with somewhere between 100 gigawatts to 150 gigawatts of incremental AI datacenter capacity (according to various estimates), that represents roughly 100 million to 150 million CPUs at an average of 120 cores per CPU.
Small wonder, then, that Arm has decided to grab the brass ring and stop screwing around. There is a $100 billion total addressable market by 2030 for CPUs in agentic AI systems, and the company has told Wall Street (but not those of us who attended the Arm Everywhere event yesterday) that it can drive $15 billion in revenues by 2031 from its AGI CPU products.
This makes royalty sales of Neoverse IP blocks and CSS licenses, which have done well, look like small potatoes. The thing about hyperscalers and cloud builders is that they do not want to have to design CPUs and XPUs. They only do this because they think the current crop of suppliers are charging too much for their chippery. Because Arm’s chips are compatible with all of the homegrown CPUs, now they can depend on Arm to deliver the goods – and there is a non-zero chance that more than a few of the tech titans will lean more heavily on the AGI CPU than their own designs.
A lot depends on the design of the first AGI CPU and what the plans are for the follow-ons shown on the Arm roadmap.
Born To Run Off Batteries
There is no way in hell that Arm can compare its AGI CPU to the homegrown Arm server CPUs that its hyperscaler and cloud builder companies have brought into existence over the past decade – and which Arm itself has benefitted financially from.
So in the presentations, the X86 architecture has to be the enemy. But the truth is that Arm has to co-exist with these homegrown designs and, if it hopes to have a stable and growing business, has to beat these chips for particular workloads on the metrics gauging performance, scale, efficiency, and cost. Awad spoke in detail on the first three in rolling out the AGI CPU, but the fourth is obviously completes the TCO and TCA equations.
To get the project with Meta Platforms going quickly, Arm chose its “Poseidon” V3 core and its “Voyager” CSS V3 platform as a starting point, which we detailed back in February 2024. The Poseidon core has a pair of SVE2 vector units per core and thus is no slouch when it comes to doing some of the computations that AI workloads require.
The AGI CPU-1, which we are calling this to differentiate it from future models and which we do not know the codename of, has up to 136 cores based on the Armv9.2 instruction set. It is etched in one of the N3 processes from Taiwan Semiconductor Manufacturing Co, which have transistors in the range of 3 nanometers and which also have practical reticle limits of a little more than 800 mm2, just like everyone else.
This 136 core count does not quite add up to the original specs for the Voyager CSS block:
The AGI CPU has two chiplets, each with half of the compute and I/O available to the socket and linked by a die-to-die interconnect. Awad said this was a better design compared to having separate I/O and memory controller dies linked to compute core dies, as both AMD and Intel have done with their CPUs because it cuts down on NUMA domains in the cache memory hierarchy and allows for low latency paths between all cores and all memory. Any core can talk to any memory linked to any controller in the socket in less than 100 nanoseconds.
The Voyager CSS specs, as you can see, have a 64-core block with six DDR5 memory controllers and four PCI-Express 5.0 controllers on each chiplet. The design was made to scale to two chiplets per socket, for a total of 128 V3 cores, a dozen memory controllers, and eight PCI-Express 5.0 controllers.
The 136 cores is weird. If you look carefully at the die shot above – and we had to do a bit of searching before we found one that gave us the right resolution on the details – you can see that each chip of the AGI CPU-1 has a grid of five columns of twelve cores for a total of 60 cores, for a total of 120 cores. Above and below these columns of V3 cores, there are what look like a different sort of core – perhaps these are “Hermes” Neoverse N3 cores, which have been allocated to do more generic server work where vector math is not as important. (See the Neoverse roadmaps for more on this.) But the spec sheet is unequivocal about there being 136 Neoverse V3 cores. But even if this is true, if we add 20 special V3 cores to 120 regular V3 cores we still get 140 cores, not 136.
The six memory controllers per chip are along the top and bottom edges of each chip in the orientation of the die shot above, with PCI-Express controllers along the right and left outside edges. There could be six or more Neoverse V3 cores lurking along the edges where the memory controllers are, but this would be odd. Things, we realize, can be odd – and often are.
This die image above could also be inaccurate, which is not playing fair.
To get a clean 136 cores is tough in columns stacked a dozen cores high. If you keep the dozen high core blocks, then six columns per chip with two chips gives you 144 cores per socket, and if you assume a 94.4 percent yield on the cores on each chiplet, you get 136 usable cores. Which, again, is not what the image shows.
We don’t know the L1 caches on each core, but we do know that each core has 2 MB of L2 cache, and the cores run at a maximum of 3.7 GHz. There is no turbo boost mode for the cores and there is no simultaneous multithreading – two things that X86 CPUs have and that Ampere Computing absolutely did not believe in, either. Both a more trouble than they are worth when you are trying to get deterministic performance out of a CPU in the Ampere Computing design philosophy – and SMT also presents another attack surface for security vulnerabilities.
Each DDR5 controller on the AGI CPU-1 can take one DIMM, and the memory can run at up to 8.8 GHz, yielding 6 GB/sec of memory bandwidth per core. (A dozen DDR5 controllers using DDR5-8800 memory yields 844.8 GB/sec of aggregate memory bandwidth, and at a 100 percent yield of 144 cores that is 5.9 GB/sec per core and at a 94.4 percent yield of 136 cores that is 6.2 GB/sec per core.
There are 96 lanes of PCI-Express 6.0 I/O on each socket, which strongly suggests that Meta Platforms will be needing higher bandwidth I/O than Arm was considering for the Voyager CSS V3 stack. This I/O may link out over DPUs to Ethernet switches, with both running the ESUN memory coherency protocol that Meta Platforms is pushing. (Or rather, pulling.) Those PCI-Express 6.0 slots can also support main memory expansion should it be necessary, of course sacrificing some of the I/O running to the outside world.
Here is the truly neat thing about this 136-core AGI CPU-1 chip: It has a thermal design point of 300 watts. This is very gratifying to see, and shows many of the engineering choices Arm did to give Meta a powerful but energy sipping motor. That is 2.2 watts per core.
By comparison, a 128-core “Granite Rapids” Xeon 6 6980P (using the full performance P cores with SMT) running at a mere 2 GHz burns 500 watts, or 3.9 watts per core. And a 144-core “Sierra Forest” Xeon 6 (using the non-SMT energy efficient cores) running at 2.2 GHz burns 330 watts, or 2.3 watts per core. We don’t know what the performance of the V3 core running at 3.7 GHz is, but we strongly suspect it is a lot better than a Sierra Forest E-core and should best a Granite Rapids P-core given the 2.85X higher clock speed.
We did get some hints about performance of the AGI CPU-1 from Awad. Take a look:
The performance shown below is all normalized to an unspecified X86 core, which we presume is a Granite Rapids P-core given that it supports SMT and the Sierra Forest E-core does not support SMT at all. The performance is all relative to a thread or a rack of threads, which is shown in the dark gray columns. The light gray columns show the effect on the metrics when SMT is turned on.
The first part of this chart shows performance per thread, which seems to show an AGI CPU thread – meaning a core, since it does not have threading – is about 1.3X higher than whatever the X86 core is. SMT actually causes performance per thread to go down on the X86 server chip, and it doesn’t help the number of sustained threads per rack by all that much. The performance per watt is where the AGI CPU is comparing well to the X86 alternative.
FYI: We have no idea what the workload is here, but given the nature of the AGI CPU and the job it is supposed to be doing in the datacenter, presumably it is an AI-centric workload being used for comparison and not a SPEC test.
What neither Haas nor Awad talked about was price, and that is a big part of what will make the AGI CPU successful or not. Arm will have to price to value to drive that $15 billion in sales by 2031, meaning it costs a less than an equivalent X86 processor per unit of performance, but adds some money back because performance per dollar per watt is the real metric. And innovation and differentiation has to happen with each subsequent AGI CPU that comes out, too.
As everyone knows, companies buy roadmaps, not point products. So here is the roadmap for the AGI CPU family:
The move to High NA with 2 nanometer processes and gate all around transistors will also for single patterning of chip masks and allows for transistors to be 1.7X smaller, which is close to triple the number of transistors. So with twice as many reticle limited High NA chiplets in a socket, Arm could in theory get around 6X the transistors to store data and do work. But it will take a four chiplet design with die-to-die interconnects on two edges our of four for each chiplet.
This chart above implies an annual cadence, but doesn’t guarantee it. It also shows very clearly that Arm will be happy to sell Neoverse IP blocks as well as future Compute Sub Systems setups to customers who want to do their own thing – as we expected everyone who is doing this to continue doing. The AGI CPU is about getting Arm server CPUs into the hands of large enterprises and companies that do not have their own chip design teams as well as giving those who design their own a second option. Just like in a healthy X86 ecosystem AMD has presented customers with a healthy alternative to Intel.
One last thought: Maybe Arm can make an Acorn workstation and fully complete the circle? The event was called Arm Everywhere, not Arm Datacenter, after all. And Haas did hint that edge and PC devices were coming, giving Arm a much broader $1 trillion TAM to chase. Why not?
In for a penny, in for a pound.