So, You Think You Can Design A 20 Exaflops Supercomputer?

UPDATED* Perhaps Janet Jackson should be the official spokesperson of the supercomputing industry.

The US Department of Energy has a single 2 exaflops system up and running – well, most of it anyway – and that of course is the “Frontier” system at Oak Ridge National Laboratory and two more slated for delivery, and that is the “Aurora” system at Argonne National Laboratory supposedly coming sometime this year and the “El Capitan” system at Lawrence Livermore National Laboratory, which is due next year. It took a lot of money and sweat to get these machines into the field – in Intel’s case, the sweat to money ratio has been pretty high given the four-year delay and massive architectural changes involved the latest and final incarnation of Aurora.

This week, the DOE put out a request for information for advanced computing systems, with Oak Ridge riding point to get the vendor community to submit its thoughts for the next generation supercomputers it expects to install sometime between 2025 and 2030 and it expects to have somewhere between 5X and 10X the performance of real-world scientific problems it tackles today, or have more oomph to take on more complex physical simulations or do ones with higher resolution and fidelity. The RFI will give the DOE the base information from which it can begin evaluating the solution space for system development from 10 exaflops to 100 exaflops and try to figure out the kinds of options it has and what research and development will be necessary to get at least two vendors building systems (if history is any guide).

The RFI is illustrative in many ways, and this particular paragraph sticks out:

“Include, if needed, high-level considerations of the balance between traditional HPC (FP64) needs and AI (BF16/FP16) needs; Include considerations, if needed, of architecture optimizations for large-scale AI training (100 trillion parameter models); domain specific architectures (e.g., for HPC+AI surrogates and hybrid classical–quantum deployments). Our rough estimate of targets includes traditional HPC (based upon past trends over the past 20 years) systems at the 10–20 FP64 exaflops level and beyond in the 2025+ timeframe and 100+ FP64 exaflops and beyond in the 2030+ timeframe through hardware and software acceleration mechanisms. This is roughly 8 times more than 2022 systems in 2026 and 64 times more in 2030. For AI applications, we would be interested in BF16/FP16 performance projections, based on current architectures, and would expect additional factors of 8 to 16 times or beyond the FP64 rates for lower precision.”

Elsewhere in the RFI, the DOE says the machines have to fit somewhere in a power envelope of between 20 megawatts and 60 megawatts – which probably means you design like crazy for 50 megawatts and it comes pretty close to 60 megawatts.

If you are cynical like that – and sometimes you have to be when facing down the slowing of Moore’s Law – then you can already get 2X performance with today’s technology just by scaling the power. Frontier weights in at 29 megawatts when fully burdened, so the first 4 exaflops is easy. Just double the size of the system and count on software engineers to figure out how to scale the code.

If you want to build a 10 exaflops to 20 exaflops system in the same power envelope and within the same $500 million budget of Frontier, which the RFI from the DOE suggests gently is a good idea, then you have a real engineering task ahead of you. And frankly, that should be the goal for advanced supercomputing systems – do a lot more with the same money and power. Enterprise IT has to do more with less all the time, while HPC has to try to rein in the power and budget. This may not be possible, of course, given the limits of semiconductor and system manufacturing.

The money is as big of a problem here as are Moore’s Law issues and coping with sheer scale, so let’s talk about money here for a second.

We had the frank discussion about the money behind exascale and beyond recently with Jeff Nichols, who is spending his last day today (June 30) as associate laboratory director for Computing and Computational Sciences at Oak Ridge, and we did some more cost projections for capability class machines when covering the “Cactus” and “Dogwood” systems that the National Oceanic and Atmospheric Administration in the United States just turned on this week for running models at the National Weather Service for weather forecasts. The 3X boost in performance that NOAA has is great, but as we pointed out, NOAA needs something more like a 3,300X increase in performance to move from 13 kilometer resolution forecasts down to 1 kilometer – and even more to get below that 1 kilometer threshold where you can actually simulate individual clouds. And that would be a serious engineering challenge – and something within the scope of the RFI that Oak Ridge just put out, by the way. Probably somewhere around 9.3 exaflops to reach 1 kilometer resolution and maybe 11.5 exaflops to reach 800 meters, which is 4,096X the compute to increase the resolution by a factor of 16X.

Money is a real issue as far as we are concerned. The rest is just engineering around the money. Let’s take the inflation adjusted budgets for the machines at Oak Ridge for the past decade against their peak FP64 performance:

  • The 1.75 petaflops “Jaguar” system cost $82 million per petaflops
  • The 17.6 petaflops “Titan” system cost $6.5 million per petaflops
  • The 200 petaflops “Summit” machine cost a little more than $1 million per petaflops
  • The new 2 exaflops “Frontier” machine cost $400,000 per petaflops.

That is a factor of 16X improvement in the cost per petaflops between 2012 and 2022. Can the cost per petaflops by 2032 be driven down by another factor of 16X? Can it really go down to $25,000 per petaflops by 2032, which implies somewhere around $50,000 per petaflops by 2027, halfway between then and now? That would be $500 million for a 10 exaflops machine, based on the accelerated architectures outlined above, in 2027 and $250 million for one in 2032. That also implies $2.5 billion for a 100 exaflops machine in 2032. And that implies a GPU accelerated architecture. You can forget about all-CPU machines unless CPUs start looking a lot more like GPUs — which is an option, as the A64FX processor from Fujitsu shows. But still, an all-CPU machine like the “Fukagu” machine at RIKEN Lab in Japan, cost $980 million to deliver 537.2 petaflops peak, and that is $1.82 million per petaflops. That is 4.6X more expensive per peak flops than Frontier. To be fair, Fukagu, like the K system before it at RIKEN, is an integrated design that performs well and is more computationally efficiency than hybrid CPU-GPU designs. But Japan pays heavily for that.

You can see why everyone wants to figure out how to use AI to stretch physics simulations and to use AI’s mixed precision engines to goose HPC performance, and you can see why the big machines for the most part using some sort of accelerator.

Using mixed precision, iterative solvers like that used in the HPL-AI benchmark boost the effective performance of the machine by anywhere from 3X to 5X as well. That’s not using AI to boost HPC, and this is an important distinction that people sometimes miss. (It would have been helpful if this benchmark was called High Performance Linpack – Mixed Precision, or HPL-MP, instead of HPL-AI because the very name gives people the wrong impression of what is going on.)

Anyway. Back in November 2019, when Oak Ridge first used the iterative solver on the “Summit” supercomputer, it took all of Summit and got 148.6 petaflops using the standard HPL test in FP64 floating point mode on its vector engines. And with the iterative solver, which used a variety of mixed precision math to converge to the same answer, it was able to get an effective rate of 450 petaflops. (In other words, it got the answer 3X faster, and therefore it would have taken a run using only FP64 math 3X the time to do the same work.)

A lot of high-end machines have been tested using the HPL-AI benchmark and its iterative solvers, which have been refined since that time. On the June HPL-AI list, the refined solvers are able to get a 9.5X speedup on Summit, making it behave like a 1.41 exaflops machine. (This is what my dad used to call getting 10 pounds of manure in a 1 pound bag. He did not say “manure” even once in his life. . . .) On Frontier, which is a completely different architecture, a dominate slice of the machine is rated at 1.6 exaflops peak, 1.1 exaflops on HPL, and 6.86 exaflops on HPL-AI, and there is no reason to believe the effective flops can be boosted by around 10X, which has happened on a machines with very different architectures.

The question is, can this iterative, mixed precision solver approach be used in real-world HPC applications to the same effect? And the answer is simple: Yes, because it has to.

The next question is: Will we “cheat” what we call 20 exaflops by counting the effective performance of using iterative solvers at the heart of simulations? We don’t think you can, because different applications will be more or less amenable to this technique. If this could be done, applications running on Frontier would already be close to 10 exaflops of effective performance. We could go retire, like Nichols. (Who did more than his fair share of conjuring massive machines into existence. Have a great retirement, Jeff.)

Either way, can’t just be thinking of the actual and effective FP64 rates on these machines. The mic drop in that paragraph from the RFI above is the need to train AI models with 100 trillion parameters.

Right now, the GPT-3 model with 3 billion parameters is not particularly useful, and more and more AI shops are training the GPT-3 model with 175 billion parameters. That is according to Rodrigo Liang, chief executive officer at SambaNova Systems, who we just talked to yesterday for another story we are working on. The hyperscalers are trying to crack 1 trillion parameters, and 100X more than that sounds absolutely as insane as it is probably necessary given what HPC centers are trying to do for the next decade.

The “Aldebaran” Instinct MI250X GPU accelerator used in Frontier does not have support for FP8 floating point precision, so it cannot boost the parameter count and AI training throughput by dropping the resolution on the training data. Nvidia has FP8 support in the “Hopper” H100 GPU accelerator, and AMD will have it in the un-codenamed AMD MI300A GPU accelerator used in the El Capitan supercomputer. This helps. There may be a way to push training down to FP4, also boosting the effective throughput by 2X for certain kinds of training. But we think HPC centers in particular want to do training on high resolution data, not low resolution data. So anything shy of FP16 is probably not all that useful.

Here is what we are up against with this 100 trillion parameter AI model. Researchers at Stanford University, Microsoft, and Nvidia showed in a paper last year that they could train a natural language processing (NLP) model with 1 trillion parameters on a cluster with 3,072 Nvidia A100 GPUs running at 52 percent of peak theoretical performance of 965.4 petaflops at FP16 precision, which works out to 241.3 petaflops at FP64 precision. If things are linear, then to do 100 trillion parameters at FP16 precision would require a machine with 24.1 exaflops of oomph. And if for some reason the HPC centers want their data to stay at FP64 precision – and there is a good chance that they will in many cases – then we are talking about a machine with 96.5 exaflops. That would be 307,200 A100 GPUs and 187,200 H100 GPUs, and if the data stayed in FP64 format, you would need 1.23 million A100 GPUs and 748,800 H100 GPUs. We realize that making a parameter comparison between NLP models and HPC models (which will probably be more like visual processing than anything) is dubious, but we wanted to get a sense of the scale that we might be talking about here for 100 trillion parameters.

Even if you assume HPC centers could use sparsity feature support on the Tensor Cores instead of the FP16 on the vector cores on an Nvidia GPU – which is not always possible because some matrices in HPC are dense and that is that – that would still be 11,232 GPUs for FP16 formats and 44,928 GPUs for FP64 formats. But HPC centers have to plan for dense matrices, too. And if you think that Nvidia can double the compute density of its devices in the next decade – which is reasonable given that it has done that – the number of streaming multiprocessors (SMs) is going to go up – we would say go through the roof – even if the machine might only need 5,000 or 6,000 GPU accelerators for sparse data. You will still need somewhere 93,600 GPUs for dense matrix processing and the level of concurrency across those SMs will be on the order of 100 million as Nvidia adopts chiplet designs and pushes packaging. (Which we think Nvidia will do because its researchers have been working on it.) If 93,600 of those future GPUs around 2032 cost around $20,000 a piece, these alone would be $1.87 billion. Most of the hypothetical $2.3 billion budget we talked about above.

AMD will be eager to keep winning these deals, of course, and with Nvidia not exactly being aggressive in HPC because it owns the AI market, AMD won’t have to worry about Nvidia too much. Intel is another story, and it will be perfectly happy to lose money on HPC deals to show its prowess. Aurora has not been a good story for Intel’s HPC aspirations, on many fronts.

We can’t wait to see how vendors respond – but this will not be data that is shared with the public. These RFIs and their RFP follow-ons are Fight Club.

Here is one other interesting bit in the RFI coming out of Oak Ridge: The DOE wants to consider modular approaches to HPC systems, which can be upgraded over time, rather than dropping in a machine every four or five years. (It used to be three to four years, but that is going to stretch because of the enormous cost of these future machines.)

“We also wish to explore the development of an approach that moves away from monolithic acquisitions toward a model for enabling more rapid upgrade cycles of deployed systems, to enable faster innovation on hardware and software,” the DOE RFI states. “One possible strategy would include increased reuse of existing infrastructure so that the upgrades are modular. A goal would be to reimagine systems architecture and an efficient acquisition process that allows continuous injection of technological advances to a facility (e.g., every 12–24 months rather than every 4–5 years). Understanding the tradeoffs of these approaches is one goal of this RFI, and we invite responses to include perceived benefits and/or disadvantages of this modular upgrade approach.”

We think this means disaggregated and composable infrastructure, so trays or CPUs, GPUs, memory, and storage can all be swapped in and out as faster and more energy efficient kickers become available. But, upgradeability is a nice dream but may not be particularly practical for capability class supercomputers.

First, swapping out components means spending all of that component money again, which is why we did not see the K supercomputer get upgraded at RIKEN Lab in Japan as Fujitsu rolled out generation after generation of Sparc64 processors. This is why we probably will not see new generations of A64FX processors, unless something happens in the HPC market in Japan that bucks history. Ditto for any machine based on Nvidia, AMD, or Intel GPUs. Touching a machine that is running is risky enough, but having to pay for it again and again is almost certainly not going to be feasible. Unless the governments of the world, and the DOE in particular in this RFI case, has a plan to put in a fifth of a machine every five years, all with mixed components. But that causes its own problems because you cannot get machines to work efficiently in lockstep when they finish bits of work at different rates.

This is going to be a lot harder than slapping together 100,000 GPU accelerators with 100 million SMs, with 1.6 Tb/sec Ethernet and InfiniBand interconnects, with CXL 5.0 links linking together CPU hosts and maybe a dozen GPUs into a node using 1 TB/sec PCI-Express 8.0 x16 links. But that’s probably where we would start and work our way backwards from there.

Update: We slipped a decimal point in the cost for machines in 2027 and 2032 due to a spreadsheet error. Our thanks to Jack Dongarra for catching it.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

4 Comments

  1. To me this article makes it seem extra likely that Artificial General Cognitive Intelligence can eventually ‘materialise’; unless this development within Sci-Tech is put a stop to by our uniquely EAVASIVE species’s in parallel operating foolishness (or, IOW, our staunch defence of our evolved neurotically motivated stupidities) and/or by some extreme solar-flare from our Sun, or by some much less likely other astrophysical impact.
    P.S. I bet that a general purpose AGI-capable Quantum Computer is an impossibility.

  2. The semiconductor industry has been telling us for a decade now that ALUs are basically free, it’s moving data around that costs joules. So in the absence (or slowing) of Moore’s law, how do you get 10 times more flops? Do you sacrifice data mobility? Do you bring the flops closer to where the data already is? Does that work for hard problems, or just for the embarrassingly parallel ones? Can you make linpack scale? Can you make anything other than linpack scale? Or do we just add a ton more transisters and lower the clock speed to the point where you fit in the power budget?

    • I think there is probably a reason that the human brain operates at around 80 hertz and only burns 10 watts or so, right?

      I also think that memory capacity and memory bandwidth per unit of compute need to be realigned, and that memory speed and compute speed need to be in better synch.

      I keep thinking of a giant stack of vertically connected wafers, something like Cerebras does but in 3D, with all kinds of coolant flowing through it and a completely 3D set of pins on the outside and pipes coming out of it like a church organ. Very slow, but very wide and with a huge amount of compute. And reconfigurable for at least some of its functions.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.