Lining Up The “El Capitan” Supercomputer Against The AI Upstarts

The question is no longer whether or not the “El Capitan” supercomputer that has been in the process of being installed at Lawrence Livermore National Laboratory for the past week – with photographic evidence to prove it – will be the most powerful system in the world. The question is how long it will hold onto that title.

It could be for quite a long time, as it turns out. Because when it comes to the big AI supercomputers that AI startups are funding, to use an old adage that described IBM systems in the 1990s: “You can find better, but you can’t pay more.”

It doesn’t look like any of the major HPC centers at the national labs around the globe are going to field a persistent machine – meaning not an ephemeral cloudy instance that is fired up for long enough to run the High Performance Linpack test in double precision floating point that is used to gauge the relative performance of machines and rank them on the Top500 list – that can beat El Capitan, which depending on our mood and our math we think could weigh in at around 2.3 exaflops peak FP64 performance, about 37 percent more FP64 oomph than the 1.68 exaflops “Frontier” supercomputer at Oak Ridge National Laboratory that has been the most powerful machine on the Top500 list since June 2022.

Way back in 2018, after the CORAL-2 contracts were awarded, we expected for Frontier to come in at 1.3 exaflops FP64 peak at $500 million using custom AMD CPUs and GPUs and for El Capitan to come in at 1.3 exaflops peak for $500 million using off-the-shelf AMD CPUs and GPUs. That was also when the revised “Aurora A21” machine was slated to come in at around 1 exaflops for an estimated $400 million. All three of these machines are being installed later than anyone hoped when the HPC labs started planning for exascale in earnest back in 2015. And in the case of Frontier and El Capitan, we think AMD offered much more bang for the buck and outbid IBM and Nvidia for the contracts, which would have naturally gone to them given they had built the prior generation “Summit” and “Sierra” systems at Oak Ridge and Lawrence Livermore. But that is just conjecture, of course.

Here’s the point of 2023 and beyond: Don’t count the hyperscalers and cloud builders and their AI startup customers out of the hunt. They are building very big machines, and perhaps ones that, like the one Nvidia and CoreWeave are working on for Inflection AI and the ones that Microsoft Azure is building for OpenAI, will surpass these massive HPC machines when it comes to lower precision AI training work.

Let’s do some math to compare as we show off the El Capitan baby pictures that Lawrence Livermore has shared.

The tractor trailer delivering some El Capitan racks to Lawrence Livermore National Laboratory

For our comparisons, let’s start with that as-yet-unnamed system being built for Inflection AI, which we talked about last week when the pictures of the El Capitan machine surfaced.

That Inflection AI machine looks like it is using 22,000 Nvidia H100 SXM5 GPU accelerators, and based on what little we know about H100 and InfiniBand Quantum 2 networking pricing, it would list for somewhere around $1.35 billion if the nodes are configured something like a DGX H100 node with 2 TB of memory, 3.45 TB of flash, and eight 400 Gb/sec ConnectX-7 network interfaces and a suitable three-tier InfiniBand switch fabric. That system would be rated at 748 petaflops of peak FP64 performance, which is interesting for the HPC crowd, and would be ranked second on the current Top500 list, behind Frontier at 1.68 exaflops FP64 peak and ahead of the “Fugaku” system at RIKEN Lab in Japan, which has 537.2 petaflops FP64 peak.

Discount this Inflection AI machine how you will, but we don’t think Nvidia or AMD are in any mood to give deep discounts on GPU compute engines when demand is far exceeding supply. And neither are their server OEM and ODM partners. And so, these machines are very pricey indeed compared to the exascale HPC systems in the United States, and they are much less capable, too.

If you look at the FP16 half precision performance of the Inflection AI machine, it comes in at 21.8 exaflops, which sounds like a lot and which is plenty enough to drive some very large LLMs and DLRMs – that’s large language models and deep learning recommendation models.

No one knows what the FP16 matrix math performance of the “Antares” AMD Instinct MI300A CPU-GPU hybrid that powers El Capitan will be, but we took a stab at guessing it back in June when a few tidbits more of data were revealed about this compute engine. We think that Lawrence Livermore not only is getting two CPU tiles on a package (replacing two GPU tiles) and six GPU tiles, but it is also getting an overclocked compute engine that will deliver more performance than an eight tile, GPU-only MI300 compute engine. (And if Lawrence Livermore didn’t get something like this, it should have.) If we are right, then without sparsity math support turned on (which Inflection AI did not use when it talked about the performance of the machine it is building with CoreWeave and Nvidia), then each MI300A is estimated to deliver 784 teraflops with a 2.32 GHz clock frequency (compared to what we expect to be around a 1.7 GHz clock frequency for the regular MI300 part).

We are hopeful that Hewlett Packard Enterprise can get eight MI300As per sled in the El Capitan system, and if that happens, the compute part of El Capitan should weigh in at around 2,931 nodes, 46 cabinets, and eight rows. We shall see.

What we wanted to make clear is that if our guesses on the MI300A are correct – we know how big that if is, people – then El Capitan should have around 23,500 MI300 GPUs and – wait for it – it should have around 18.4 exaflops of FP16 matrix math peak performance. That is 80 percent of the AI training oomph of the AI system being built with all that venture capital money by Inflection AI, and for a lot less money and with a lot more FP64 ommph.

El Capitan is in a raised floor datacenter environment, and you have to reinforce the floor to wheel those “Shasta” Cray XE racks from Hewlett Packard Enterprise into place.

Now, let’s take a stab at what the rumored 25,000 GPU cluster that Microsoft is building for OpenAI to train GPT-5. Historically, as Nidhi Chappell, general manager of Azure HPC and AI at Microsoft, explained to us back in March, Azure uses PCI-Express versions of Nvidia accelerators to build its HPC and AI clusters, and it uses InfiniBand networking to link them together. We assume this rumored cluster uses Nvidia H100 PCI-Express cards, and at $20,000 a pop, that is $500 million right there. With a pair of Intel “Sapphire Rapids” Xeon SP host processors, 2 TB of main memory, and a reasonable amount of local storage, add another $150,000 per node and that works out to another $469 million for 3,125 nodes to house these 25,000 GPUs. InfiniBand networking would add, if Nvidia’s 20 percent rule is a gauge, another $242 million. That’s $1.21 billion. Discount the server nodes if you feel like it, but that is $387,455 per node and it ain’t gonna budge that much. Not with so much demand for AI systems.

As we say in New York City: Foegittaboutit.

If you do the math on this Microsoft/OpenAI cluster, it weighs in at 19.2 exaflops FP16 matrix math peak with sparsity off. The PCI-Express versions of the H100 have fewer streaming multiprocessors – 114 versus 132 on the SXM5 version and they clock slower, too. That’s about 11.4 percent cheaper for 11.9 percent less performance.

These prices are crazy compared to what the US national labs are getting – or at least. Have been able to get over the years. The reason why the HPC centers of the world chase novel architectures is that they can pitch themselves as research and development for a product that will eventually commercialized. But the hyperscalers and cloud builders can do this same math and they can also build their own compute engines, as Amazon Web Services, Google, Baidu, and Facebook are all doing to varying degrees. Even with a 50 percent discount, those Inflection AI and OpenAI machines are still a lot more expensive per unit of compute than what the US national labs are paying.

One El Capitan row down, perhaps seven more to go.

El Capitan will take up the same footprint that the retired “ASCI Purple” and “Sequoia” supercomputers from days gone by, built by IBM for Lawrence Livermore, used successively – about 6,800 square feet. El Capitan is expected to need somewhere between 30 megawatts and 35 megawatts of power and cooling at peak, and will run side-by-side with the next exascale-class machine that Lawrence Livermore expects to install around 2029, and so the datacenter power and cooling capacity at the lab has been doubled to accommodate these two machines running concurrently.

By comparison, that ASCI Purple machine built by IBM and installed in 2005 at Lawrence Livermore was rated at 100 teraflops peak performance at FP64 precision and burned about 5 megawatts; it cost an estimated $128 million. El Capitan could have 23,000X more performance at somewhere between 6X and 7X the power draw and at 3.9X the cost. That may not be as good as the exponential growth that supercomputing centers had expected for many decades, but it is still a remarkable feat and attests to the benefit of Moore’s Law and a whole lot of packaging, networking, and power and cooling cleverness.

We can’t wait to see the real numbers for El Capitan and Aurora A21 at Argonne National Laboratory. And if, as we suspect, Intel wrote off $300 million of the $500 million contract with Argonne, then there is not going to be a cheaper AI and HPC in the world. Yes, Argonne paid in time and will pay in electricity to use this machine, but as we pointed out two weeks ago when the Aurora machine was being fully installed, what matters now is getting the machine built and doing actual HPC and AI.

Are those machines liquid cooled? Are the blues and reds at the back of the units tubes for the liquid cooling or just cables? Also, in the last picture, what is the white gas can like container at the bottom of the last rack?

Timothy Prickett Morgan says:

August 4, 2023 at 7:30 pm

The Cray racks are water cooled as far as I know. The can is a helium tank, which the sysadmins use to make their voices sounds like chipmunks. This is exactly what you need to do when you are dealing with nuclear weapons.

Reply
ScottA says:

August 19, 2023 at 9:52 am

The white tank is a make-up tank in the Cooling Distribution Unit (CDU). Replacing blades may remove some water from the system. If the CDU needs to add some coolant, it pulls it from this tank.

Reply

Amit Patel says:

July 10, 2023 at 7:34 pm

Per your article on MI300A from June, 1,567 teraflops is for FP8 and not FP16 — for for dense matrix.

I think you are off by 2x in this article.

- Timothy Prickett Morgan says:
  
  July 10, 2023 at 10:27 pm
  
  For the FP16 without sparsity, yes. I can’t read my own chart! Fixing it now. My eyes are not as young as they used to be….
  
Hubert says:

July 10, 2023 at 9:35 pm

Right on! Efficiency is such a huuuge outcome of this evolving tech success story — to wit, the Table in your “Charm Of AMD” piece notes that one MI300A does 98.1 FP64 TeraFlops/s, which is just about the same as 2005’s whole ASCI Purple supercomputer (100 TF/s), all for 1 kW (rather than 5 MW)!.

- ⟨φ|8^p|ψ⟩ says:
  
  July 11, 2023 at 7:01 pm
  
  Hmmmm … bring on the petaflopping cephalopods (nodules of 16 MI300A or GH100 units, with uber-fast networking tentacles), scatter them about in a thousand remote locations, and heed the composable rise of solar-powered ExaCthulhu!
  
Tufttugger says:

July 11, 2023 at 9:21 am

A 35% clock speed bump for the GPU chipsets sound extreme. I’d think more like 20% could be possible, near 2ghz.

- Timothy Prickett Morgan says:
  
  July 11, 2023 at 11:02 am
  
  I agree–it is extreme. But that is supposed to be HPC’s job.
  
Daniel Shaver says:

July 11, 2023 at 10:29 am

Wait until AI develops new form of logic types that are optimized for AI computation. Computers today are focused on binary, what if AI develops 3d/4d programmable logic type that combines a multitude of inputs and outputs obsoleting everything.

- Timothy Prickett Morgan says:
  
  July 11, 2023 at 11:02 am
  
  OK, now you went and hurt my head there….
  
Eric Olson says:

July 11, 2023 at 5:20 pm

Since Fugaku is 14 percent faster than Frontier when doing conjugate gradients, it seems likely that El Capitan will also top the possibly more relevant HPCG when the benchmarks are ready.

Craig Marley says:

July 13, 2023 at 9:57 am

Why no mention of Tesla’s new DOJO?

- Timothy Prickett Morgan says:
  
  July 13, 2023 at 10:41 am
  
  is it real yet?
  
peter j connell says:

July 30, 2023 at 4:09 am

The killer headline here Timothy, would have been “Uncle Sam Screws Server Suppliers”

SJ says:

August 4, 2023 at 6:26 pm

Are those machines liquid cooled? Are the blues and reds at the back of the units tubes for the liquid cooling or just cables? Also, in the last picture, what is the white gas can like container at the bottom of the last rack?

- Timothy Prickett Morgan says:
  
  August 4, 2023 at 7:30 pm
  
  The Cray racks are water cooled as far as I know. The can is a helium tank, which the sysadmins use to make their voices sounds like chipmunks. This is exactly what you need to do when you are dealing with nuclear weapons.
  
- ScottA says:
  
  August 19, 2023 at 9:52 am
  
  The white tank is a make-up tank in the Cooling Distribution Unit (CDU). Replacing blades may remove some water from the system. If the CDU needs to add some coolant, it pulls it from this tank.

Lining Up The “El Capitan” Supercomputer Against The AI Upstarts

Sign up to our Newsletter

15 Comments

Leave a Reply Cancel reply

Sign up to our Newsletter

Related Articles

Hyperscalers Bringing Nvidia’s Grace-Blackwell Superchip To Their Clouds

A Hackathon To Battle The Coronavirus Pandemic

The Most Complex Chip Ever Made?

15 Comments

Leave a Reply Cancel reply