Stacking Up Google’s “Ironwood” TPU Pod To Other AI Supercomputers

As part of the pre-briefings ahead of the Google Cloud Next 2025 conference last week and then during the keynote address, the top brass at Google kept comparing a pod of “Ironwood” TPU v7p systems to the “El Capitan” supercomputer at Lawrence Livermore National Laboratory. And they kept doing it wrong, and it has annoyed us.

It is perfectly legitimate to make such a comparison for large-scale AI systems, even if in one case (El Capitan) the machine’s primary purpose is to run traditional HPC simulation and modeling workloads and in the other (Ironwood Pod), the machine can’t do high-precision floating point math at all and is really only designed to do AI training and inference. In a sense, a machine using a hybrid architecture of CPUs and GPUs both for compute is more of a general purpose machine given its wide numerical type and precision for calculations and the wide variety of workloads, and there is some value into such a machine architecture in that it is multipurpose.

But, as it turns out, exascale-class machines like El Capitan at Lawrence Livermore and “Aurora” at Argonne National Laboratory, can hold their own against machines built with custom XPU accelerators, and because of the sweet deals that the US government’s Department of Energy secures with supercomputer makers, offer bang for the buck that we think is superior to what Google pays to use its own machinery and is far less than what it charges customers who rent access on the TPUs for their AI workloads.

Here is one of the offending charts we saw:

In this math, Google is comparing the sustained performance of El Capitan with 44,544 AMD “Antares-A” Instinct MI300A hybrid CPU-GPU compute engines running the High Performance LINPACK (HPL) benchmark at 64-bit floating point precision against the theoretical peak performance of an Ironwood pod with 9,216 of the TPU v7p compute engines.

This is a perfectly silly comparison, and Google’s top brass not only should know better, but does. But perhaps more importantly, performance is only half the story. You have to reckon the cost of the compute as well. High performance has to have the lowest cost possible, and no one gets better deals on HPC gear than the US government’s Department of Energy.

In the absence of a lot of data, we did a price/performance analysis of modern AI/HPC systems, many of which are based on the combination of CPUs and GPUs, with the latter being from AMD or Nvidia and the former not being all that important in terms of raw computation. Take a look:

Click to enlarge

This comparison is not perfect, we realize. The Google and Amazon Web Services pricing includes the cost of renting the systems for three years, which of course includes the cost of power, cooling, facilities, and management. For many of the supercomputers shown, the budget is for facilities, power, cooling over three to four years, and we have done our best to not include any non-recurring engineering (NRE) costs to get machines in the field running and tuned up. For the various AI machines, we gave estimates on machine sizes and costs where information was not available.

All estimates are shown in bold red italics, and we have question marks where we are not able to make an estimate at this time.

We only showed the TPU systems that had 3D torus interconnects linking them together into fairly large pods. And so the prior generation “Trillium” TPU v6e systems, which only scale to 256 compute engines in a 2D torus topology.

As you might expect, over the past four years, the cost of both FP64 high precision and FP16 and FP8 low precision processing has come down even as the performance of machines has gone up. Which is a good thing. But the cost of machines is going up fast, to the point that what we would call a capability-class AI supercomputer now costs billions of dollars. (Think the xAI “Colossus” machine shown above, which was installed last year.)

In the table above, we calculated the cost of renting Google TPU pods under committed use discounts, or CUDs, which are akin to reserved instance pricing at Amazon Web Services and which give discounts for long-term commitments. A traditional HPC supercomputer was in the field for three years, sometimes four, so this is a good comparison point. The estimated Ironwood TPU pod pricing assumes Google is somewhat aggressive, as it was in the jump from the TPU v4 pods to the TPU v5p pods.

Now, to clear up the confusion. An Ironwood TPU v7p pod is rated at 21.26 exaflops at FP16 resolution and double that at FP8 resolution at 42.52 exaflops. This pod has 1.69 PB of HBM memory and we estimate it costs around $445 million to build and north of $1.1 billion to rent over the course of three years. If you do the math on that, that works out to Google being able to use the Ironwood pod with 9,216 Ironwood TPUs interlinked for around $21 per teraflops, and you can rent it for around $52 per teraflops.

The El Capitan machine, which is built by Hewlett Packard Enterprise, cost Lawrence Livermore $600 million, and that works out to $14 per teraflops at FP16 resolution at peak performance. Because Intel took a $300 million writeoff on the “Aurora” machine at Argonne, that DOE lab only paid $200 million for that AI/HPC system, which means its 16.1 exaflops at FP16 precision, that FP16 oomph only costs $12 per teraflops. The Aurora machine’s “Ponte Vecchio” GPUs, unlike both El Capitan’s MI300A ceepie-geepies and the Ironwood pod’s TPU v7p engines, do not support FP8 processing but they do support INT8 processing just like the two prior generations of Google TPUs used in 3D torus setups.

The FP8 and INT8 format has the effect of doubling the price/performance on any machine that has it and that its workload can take advantage of; FP4, which is available on “Blackwell” GPUs from Nvidia and will be added to future XPU AI compute engines, doubles it again.

We normalize to FP64 performance for HPC and FP16 for AI just to keep it simple, but added a column for FP8 or INT8 processing. Companies will stick with a floating point format across both training and inference at this point as they can, and eventually INT16, INT8, and INT4 formats will be deprecated.

The AWS P5 UltraCluster is the poster child for clusters built using Nvidia “Hopper” H100 GPUs in late 2022, all through 2023, and in early 2024. We calculated the cost to rent a cluster with 20,000 GPUs and then backed into an estimated cost of acquisition based on the prevailing H100 and other system costs at the time. Microsoft Azure and Google Cloud would have paid about the same amount to build similar machines as well as to rent the capacity on them to end users. AWS and Microsoft locked their prices, in fact, for GPU instances, which may or may not be legal.

The Ironwood pod, if our estimates are correct, will cost about a third as much for Google to build and for customers to rent than these H100 clusters of similar performance, and they will use less than half as many compute engines. (As gauged by socket count, at least.)

But, finally, let’s be clear. El Capitan has 2.05X more performance at FP16 and FP8 resolution than an Ironwood pod at peak theoretical performance, The Ironwood pod does not have 24X the performance of El Capitan. It is true that El Capitan has 2.73 exaflops of peak performance at FP64 precision, and Ironwood has none, and that El Capitan had a rating of 1.74 exaflops on HPL in FP64 mode.

We do not have an HPL-MxP result for El Capitan yet, but we expect one at the ISC 2025 conference in Hamburg in June. HPL-MxP uses a bunch of mixed precision calculations to converge to the same result as all-FP64 math on the HPL test, and these days delivers around an order of magnitude effective performance boost. This use of mixed precision leads the way in how real HPC applications can be tailored and boosted for lower precision math and therefore either get more work done on the same hardware or use less hardware to get the same work done.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

3 Comments

  1. Always appreciate these comparison tables. On power–how should we interpret this statement from the TPU v7 release? “Ironwood is built to support this next phase of generative AI and its tremendous computational and communication requirements. It scales up to 9,216 liquid cooled chips linked with breakthrough Inter-Chip Interconnect (ICI) networking spanning nearly 10 MW.”

  2. Great analysis covering the different tiers of the market and account for the short-term and long-terms costs and goals the mega hyperscalers, national labs and OEM vendors play in. Appreciate your points on addressing the scale of cloud providers and what they can bear upon the market with their sheer scale.

    How does one even determine a price to offer AWS in exchange for CPUs and CPUs along with all the other OEM hardware knowing full well AWS has their own ARM chips they can sell to your customers

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.