Every couple of years, Lawrence Livermore National Laboratory gets to install the world’s fastest supercomputer. And thankfully the HPC center usually chooses a machine that not only fulfills its mission of managing the nuclear weapons stockpile of the United States military, but also picks a mix of technologies that advances the state of the art in supercomputing.
This is what history has taught us to expect from Lawrence Livermore, and with the “El Capitan” system unveiled today at the SC24 supercomputer conference, history is indeed repeating itself. But this time is a little different because El Capitan is booting up amidst the largest buildout of supercomputing capacity in the history of Earth.
As far as we and the experts at Lawrence Livermore can tell, on many metrics El Capitan can stand toe to toe with the massive machinery that the hyperscalers and cloud builders are firing up for their AI training runs. El Capitan is a machine that is tailor made to run some of the most complex and dense simulation and modeling workloads ever created that just so happens to be pretty good at the new large language models that are at the heart of the GenAI revolution.
And thanks to the “Rosetta” Slingshot 11 interconnect designed by Cray and a core component of the EX line of systems sold by Hewlett Packard Enterprise, El Capitan already employs an HPC-enhanced, scalable Ethernet along the lines of what the Ultra Ethernet Consortium is trying to advance as the hyperscalers and cloud builders tire of paying a premium for InfiniBand networks for their AI clusters.
But perhaps more importantly – and people do not consider this enough – Lawrence Livermore is getting an extremely powerful HPC/AI supercomputer for a hell of a lot less money than the hyperscalers, cloud builders, and large AI startups are paying these days. It is hard to say with precision what the difference is, but our initial back of the envelope calculations are that El Capitan costs half as much per unit of FP16 performance as the big “Hopper” H100 clusters being built by Microsoft Azure, Meta Platforms, xAI, and others.
There are benefits to being integral for national security, for pushing the architectural limits in system design as El Capitan does, and for having an AMD that is eager to prove its mettle in designing a hybrid CPU-GPU compute engine with wicked fast HBM memory feeding into a shared memory space across those converged devices.
And finally, there is another big difference between El Capitan and the beastly machines being built by the hyperscalers, cloud builders, and AI startups. El Capitan will manage the nuclear weapons that can in turn cause an extinction level event on our planet (or avoid one through the mutually assured destruction doctrine depending on your view), while the AI clusters are themselves the machines that may cause an extinction level event on Earth. (Let’s hope El Capitan’s AI is in a strong sandbox.)
The nuclear weapons in the US stockpile need to be simulated to ensure that they function properly – the Nuclear Test Ban Treaty prevents us from blowing one up to know for sure. It is also necessary to redesign the nuclear weapons in the stockpile and reuse their explosives, but the test ban means this has to be done through simulations alone. Hence, the big budgets from the DOE for supercomputing in the United States.
The Feeds And Speeds
The million El Capitan contract was awarded to Hewlett Packard Enterprise back in August 2019, and all we knew back then that the machine would make use of the Slingshot interconnect, cost around $500 million, and deliver at least 1.5 exaflops of sustained performance. Only a few months earlier, HPE said is was going to buy Cray for $1.3 billion, and we think the relative small size of Cray compared to the large size of the bill of materials for the three exascale systems being built by the US Department of Energy is one of the reasons why HPE was probably “encouraged” to buy Cray in the first place.
Anyway, back then, El Capitan was expected to have at least 10X the sustained performance of the “Sierra” hybrid CPU-GPU system built by IBM for the lab, and fit into a 30 megawatt power envelope. In March 2020, Lawrence Livermore said that it was tapping AMD for El Capitan’s compute engines, and said further that the system would have in excess of 2 exaflops of peak theoretical FP64 performance – real systems can compute in 64-bit resolution – and would burn around 40 megawatts and cost no more than $600 million. (“Upgrade!”)
The El Capitan hybrid CPU-GPU system is installed and running at near full capacity at Lawrence Livermore, and the consensus is that this is the highest performing system for traditional simulation and modeling workloads in the world. That includes the rumored peak performance of the “Tianhe-3” (2.05 exaflops) and “OceanLight” (1.5 exaflops) supercomputers in China.
In June 2022, Lawrence Livermore and AMD announced that they would be employing a converged CPU-GPU device, which AMD has called an accelerated processing unit, or APU, for decades now, as the main compute engine in the El Capitan system. And since that time, everyone has been trying to guess what the clock speed of the “Antares” Instinct MI300A devices would be, how many GPU compute units are in the device, and what clock speed they would run at. As it turns out, we thought the MI300A clocks would be higher and therefore it would take fewer of then to reach the performance. We also think that Lawrence Livermore is getting an even faster machine than it anticipated, and hence even better price/performance than expected.
Bronis de Supinski, the chief technology officer at Livermore Computing at Lawrence Livermore National Laboratory, tells The Next Platform that there are 87 compute racks in the El Capitan system, and these racks include the “Rabbit” NVM-Express fast storage arrays, which we detailed way back in March 2021, as well as the compute nodes.
El Capitan has a total of 11,136 nodes in liquid-cooled Cray EX racks, with four MI300A compute engines per node and a total of 44,544 devices across the system. Each device has 128 GB of HBM3 main memory that is shared across the CPU and GPU chiplets, which runs at 5.2 GHz and which delivers an aggregate 5.3 TB/sec of aggregate bandwidth into and out of the CPU and GPU chiplets.
The MI300A CPU chiplets run at 1.8 GHz according to the data in the Top500 rankings for November, and the AMD spec sheet says that the GPU chiplets run at a peak 2.1 GHz. There are three “Genoa” X86 compute complexes, each with eight cores, for a total of 24 cores, etched in 5 nanometer processes from Taiwan Semiconductor Manufacturing Co. There are 228 GPU compute units on the six Antares GPU chiplets on the MI300A device, which have a total of 912 matrix cores and 14,592 streaming processors. On the vector units, the MI300A has a peak FP64 performance of 61.3 teraflops and on the matrix units, FP64 is twice that at 122.6 teraflops.
Each El Capitan node has 250.8 teraflops of peak FP64 performance, and when you lash all of those nodes together, you get 2,792.9 petaflops of aggregate FP64 oomph with 5.475 PB of HBM3 memory front-ending it. There are four I/O dies underneath the CPU and GPU compute chiplets, which glue these elements together and to the HBM3 memory; these chiplets are etched in 6 nanometer processes from TSMC.
It is interesting to us that there are still six compute dies (XCDs in the AMD lingo) on the MI300A package that are perfectly balanced against six GPU chiplets. The “Frontier” supercomputer at Oak Ridge, the older sibling to El Capitan, also had a one-to-one ratio between custom “Trento” CPU XCDs (eight per node in a single chip) paired to a quad of discrete dual-chip “Aldebaran” MI250X GPUs. This one to one packaging has been maintained between CPUs and accelerators across many generations of Cray supercomputers, and this is probably not an accident. In a sense, the MI300A is a six-way X86 CPU server cross-coupled to a six-way GPU system board.
Here is a summary table that shows the feeds and speeds of the El Capitan system and its “Toulumne” and “rzAdams” chips off the El Capitan block at Lawrence Livermore and the “El Dorado” system at Sandia National Laboratories:
Here is what the schematics of the El Capitan server node looks like:
As you can see, there are four Infinity Fabric x16 ports, which have 128 GB/sec of aggregate bandwidth, linking the four MI300A devices to each other in a memory coherent fashion.
There are another four ports, one per APU, coming off the MI300As that can be configured as either a PCI-Express 5.0 x16 slot of an Infinity Fabric x16 slot, and in this case they are set up as the former to allow for the plugging in of the Slingshot 11 network interface cards that actually link the APUs to each other across the system with the Slingshot 11 fabric.
A funny final thought on the El Capitan system, which is technically 2,746.38 petaflops on the portion of the machine that was used to run the High Performance Linpack benchmark that is used to rank supercomputers. (That portion of the machine had 43,808 APUs activated out if a total of 44,544 in the physical machine, or 98.3 percent of the capacity of the machine.) That 46 petaflops at the tail end of the rated performance – the third and fourth significant digits of the performance – is bigger than all but 34 of the machines on the November 2024 Top500 list. Those rounding digits that are thrown away when you say “2.7 exaflops” is almost the same size as the “MareNostrum 5” supercomputer at the Barcelona Supercomputing Center.
There is another 1.65 percent performance increase for El Capitan if Lawrence Livermore gets HPL running across all of the APUs in the system, and we think improvements to the interplay of compute, memory, and interconnect could boost it another 5 percent or so, we think. If Lawrence Livermore can push that software and network tuning performance by 7.5 percent, then the peak HPL capacity of the machine will break through 3 exaflops, and we hope the lab can get there just because it is fun. That would be double the original expected performance of El Capitan five years ago when the project started – and on time and on budget to boot.
As a point of clarification, the “rabbit” modules are fully integrated into the existing 87 EX compute cabinets in El Cap. There are no additional racks dedicated solely to the rabbits.
That makes El Cap doubly cool IMHO … first for its direct liquid-cooling, and second for its top-hat shaped cabinets in which rabbits are fully integrated, for NVM-Express magic wand extraction and storage, with nothing up the sleeves and long coat tail distributions of the more typical data prestidigitations (or so I’m told)! 8^p
The great honour HPC Elcapitan total functional enormous machine its total merit by H. Packard Enterprise, a hightech confident corporation. !!!
El Capitan will surely do well modeling limited yield nuclear weapons. Really hoping he other side is in agreement on minimizing yield as much as possible. Russia and the US once tested and even deployed ICBMs with yields in the tens of megatons. At least the US tries to keep weapons below 2Mt, not sure about Russia and China. The bigger ones aren’t suitable for MIRV anyway.
I love this integrated aspect of ElCapitan’s MI300As where CPUs and GPUs are packaged together. The improved communications (and arch I guess), relative to Frontier, seem to give El Cap a 33% better Rpeak and (currently) 29% better Rmax, with just 22% more cores and 20% more juice. That’s clearly going in the right direction!
I wonder what would happen if the machine was run above 1.8 GHz, at something closer to the 2.1 GHz max of the GPUs, and 3.7 GHz max of the CPUs ( https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html )?
In the 2007 era Barcelona Supercomputing Center had an IBM super located in an old Catholic church. The racks were in a big glass enclosure where the pews were. The photos of it were quite beautiful. Sadly I did not get to visit it.
MN5 has two partitions: a 45PF General Purpose partition and and a 230PF Accelerated partition. This article seems to leave out the Accelerated partition when it compares MN5 to thrown away rounding digits
https://www.bsc.es/marenostrum/marenostrum-5
Fair enough. I was just giving people a frame of reference.