Investment in supercomputing and related HPC technologies is not just a sign of how much we are willing to bet on the future with someone else’s money, but how much we believe in it ourselves, and more importantly, how much we believe in the core idea that we can predict and therefore shape the future of the world.
After watching the HPC simulation and modeling market for three and a half decades now, and watching the rise of the hyperscalers and generative AI based on the personal information culled from billions of people and untold zillions of interactions between us and our retailers, our governments, our schools, and every other kind of imaginable telemetry, we think, as we have thought for many years, that the world is not taking HPC as seriously as it needs to if we are to solve some pretty big problems. Not even close.
One of the core reasons for this, of course, is that it is very hard to make money – meaning profits – from HPC, even though it has become extremely expensive to build capability-class supercomputers. One only look at the financials of IBM, Cray, SGI, and Hewlett Packard Enterprise (which ate SGI and Cray) to see that, over the long haul, the supercomputing business doesn’t really bring much cash to the bottom line.
Part of the reason for this is that the biggest HPC users are really and truly research and development partners for the compute engine, networking, and storage suppliers that work with supercomputer builders to put massive machines into the field. What profits there are usually end up in the pockets of the CPU and GPU makers and to a lesser extent of the interconnect and storage makers. The profits for Nvidia GPUs are absolutely enormous these days, but that is driven more by too much AI customer demand chasing limited Nvidia high-end GPU and interconnect supply.
Another reason, we think, why HPC isn’t a profitable business is because it deals in hard to quantify benefits and is not driving a worldwide advertising, search engine, and cloud computing machine that is at the heart of the Internet and therefore the habits of users and potential product buyers. And thus, HPC is a bit like machine learning back in the 1980s, when all of the groundwork was laid for success in the 2010s and beyond. Back then, AI researchers had the right algorithms for convolutional neural networks, but they did not have a lot of labeled or unlabeled data on which to train networks and they certainly did not have enough parallel computing power to build the networks in a reasonable timeframe. Similarly, the best HPC systems in the world can only simulate a rude approximation of anything for a reasonably long term, or a high fidelity approximation for a very short term, measured in picoseconds to seconds depending on what it is. We just don’t have enough compute to really simulate at the necessary scale.
Nvidia created massively parallel, datacenter-class GPU compute engines because it wanted to accelerate HPC simulation and modeling workloads that were both compute and bandwidth constrained a decade and a half ago. And it was because of the advent of the general purpose GPU compute engine that AI researchers at the hyperscalers and academia that finally gave them a platform on which to run their AI algorithms, which have grown in scale and complexity and usefulness at an exponential rate for a decade and a half.
But for all the talk about AI, this is frankly easy. And it is a parlor game compared to the grand challenges of HPC simulation and modeling. As we start 2024 and the AI hyperhype cycle, this needs to be said.
It is hard enough to get the money together to do proper exascale computing at high floating point resolution, which presents some incredibly difficult engineering and budgetary challenges. It is hard to press more performance into our compute engines and it is harder still to face the fact that the cost of capability-class supercomputers has been on the rise for decades. A top-end HPC machine used to cost $50 million, then it was $100 million, then $200 million, and more recently it is more like $500 million. A high-end AI system with more than 20,000 GPUs costs on the order of $1 billion, and it costs about 2.5X that to rent one over a four-year span. HPC can ride on the coattails of AI, and now an expensive machine can be made to do both of these very different kinds of work, which makes the budget stretch further. But one could argue that you get sub-optimal design by supporting these two architectures on the same machine.
Here in 2024, we care less about that than we do about solving grand challenge problems. Here are just a few of them off the top of our hotheads:
- We want to cure cancer. Or rather, cancers. We wanted for cancer to be cured for our grandparents, and then our parents, and then for ourselves, and while much progress has been made for which we are truly grateful, now we are hoping we can do it for our children or our grandchildren. Curing cancer is far more important than unemploying billions of people with hyperscale AI engines running generative AI, even if it doesn’t make money and in fact may make medical care less medieval and less expensive in the longest of runs.
- We want to actually get hourly and daily weather forecasts at the sub-kilometer grid scale that allows for several step functions in forecast accuracy, which could take on the order of 10 exaflops to accomplish.
- We want Earth-scale climate change models so we can stop arguing about what is and is not happening and have those models run against past weather to prove they are accurate and run forward to give us the best case scenario for what the climate will do. We need to all see and believe whatever the probable scenarios are and start working to manage the climate of our planet. Whether or not climate change is natural is not an argument worth having. Talking about the enormous engineering opportunity and job creation plan to make Earth better suited to current lifeforms is the point.
- We want nuclear fusion power yesterday. As we have joked about a couple of times recently, we want to build a time machine to send that nuclear fusion power and advanced chip technology back to 1969 so we have the electricity to build exascale computers in the past so we can fix all of these problems earlier. OK, we are mostly joking about the time machine. . . but we do need nuclear fusion. And we need large scale simulations and the help of AI to pare down the domain space of how we might create portable baby suns burning tritium and deuterium called Mr Fusion that are made by Newell Brands and hawked on the Internet by Neil deGrasse Tyson.
- We can throw in full-scale nuclear bomb simulation and detonation simulation to help governments convince themselves they ought to pay for it. We were supposed to build impulse engines and warp drives, but we do have to maintain our nuclear stockpiles for détente.
The nature of the problems that we want solved with simulations and models are more like zettascale or yottascale. Some of these problems may even be on the order of hellascale or brontoscale, for all we know. Those are a billion times that of exascale, which is hard to even think about. Particularly with exascale being so tough and zettascale sounding absurd right now. Yes, AI can help accelerate the models and simulations, and in a number of ways, including extrapolating between timesteps in the simulation to lower the compute capacity needed for any given simulation – generating fake but reasonably good data, in effect – as well as figuring out what to simulate and how.
And the lower precision pushed by AI workloads might help as well in getting algorithms to converge without relying solely on FP32 or FP64 floating point, as the HPL-MxP mixed precision variant of the glatt kosher High Performance LINPACK benchmark shows so well as it boosts effective performance of machines by an order of magnitude across many architectures. The plain vanilla HPL, of course, has been used to gauge the theoretical sustained performance of 64-bit machines for decades – since we had 64-bit machines, really. If mixed-precision iterative refinement that underpins HPL-MxP works to turbocharge HPL, presumably the same techniques can work on real-world workloads. But that is only a step function, and it is a one trick pony, even if it does increase the bang for the buck of supercomputers by an order of magnitude.
From where we sit, the HPC industry has a larger problem than arguing for budgets and getting someone to not make money building the most advanced computers in the world every three years.
People are what we believe we are, and the future we simulate individually, with all kinds of input from the outside world including the predictions of others, within our own minds every day, is the one we pursue and cultivate, the one that we believe in. This is how people work. HPC is a way to have a shared simulation, a kind of time machine that can look forward and backward and help us all understand where we have been, where we are, and where we are going. If anything, we do not have anywhere near the capacity available to do HPC properly.
Our conviction lacks scale, and therefore, so does our supercomputing. It is that simple.
Supercomputing – let’s call it hypercomputing – is not a national or global phenomenon, like sending two dozen people to the Moon and having a dozen of go down to the surface to walk around. We all did that together, and we are old enough to remember the first time it happened and young enough to still be inspired by that.
We have to prioritize, and we need to do it now. There is no question that Earth needs cheaper electricity – and fully burdened costs here please – as much as we need cheaper flops. We need that electricity for a lot more than hypercomputers with 1,000X or more oomph of what we can put together today – for every joule of energy that was released from a carbon source since humanity started, it will take more than one joule to lock that carbon back up, for instance. We know that supercomputer performance is not going to scale as fast as power consumption, and has not for years. But hypercomputing is a special case where we should not care so much about the power that is burned if it solves the kinds of problems we mention above. If we solve such grand challenge problems, the power is worth it. And if we don’t shift to hyperscale computing, maybe we can’t solve these problems and this is all just academic.
Either you believe HPC simulation and modeling can change the world – can save the world – or you don’t.
And if you believe that, as we do, then we have to convince others that we can solve such grand challenges and make all people the beneficiaries of the results. It seems only fair, since all people will be paying for it.
Which brings us to the next point. We want the governments of the world to spend less time supporting a portion of the semiconductor industry and a portion of the server, networking, and storage industry that does HPC where the players never make any profit on HPC systems and we want to spend more time building the machines that can do the job – the real hypercomputing job – and build enough of them to help us solve the immense problems that we have on this planet.
We don’t care if it costs $20 billion or $30 billion a year for the United States alone. We don’t care if it costs $100 billion a year globally. Get it done. Earth spent somewhere around $2.4 trillion on its combined militaries last year. We don’t have A Small Talent For War, we have a small talent for HPC. And now we have a huge appetite for generative AI, and on a dime last year somewhere around $50 billion was spent on servers and storage to support it.
Let’s talk about money for a bit longer. In the United States, state and local governments spent $3.5 trillion in 2020 (the most current data I can find) and the Federal government spent $1.6 trillion in its fiscal 2020 year. There is some double counting in there, as Feds give the states some money and the states give local governments some money, Call it $4.5 trillion in net spending for the sake of argument. What is $20 billion or $30 billion a year against this?
Microsoft, Amazon Web Services, and Google spend more than that to sift through the digital residue of our lives (or help partners do it through their applications) and guess what we might need to buy or see next. . . . And by the way, they guess wrong a lot as far as I can tell from my personal experience. (Dummy, I already bought Nicole these kinds of dresses. . . .)
We don’t care if our taxes pay for this. In fact, we prefer that our taxes pay for this. Send us our share of the bill directly every month for the National AI and HPC Facility and we will pay it. Can it be any worse than our cable bill or our smartphone bill? Not really, as it turns out.
There are 130.18 million families in the United States right now and the average family spends $1,309 a year for cable and $1,368 a year for phone service. That is $170.4 billion on cable and $178.1 billion on phone. The National AI and HPC Facility bill would work out to $13 to $19 per family per month, a hell of a lot cheaper than the $109 average per month for cable and the $114 average per month for phone that Americans pay.
We need to do something big and national – global as a collective of nations working together – that brings us all together and that actually accomplishes something big and national and global. Like NASA in the 1960s, may we never forget.
Maybe we can build one immense cloud-style real AI-HPC hypercomputer and put it in Lawrence, Kansas. Bury it deep, keep it safe. Sam and Dean can help.
And while we are thinking about it, why not use FPGAs instead of ASICs for a lot of this? Pay the cost in power and performance and gain flexibility and longevity. Having to buy two machines for $1 billion over ten years is no better than building one $1 billion machine that is designed to be used for ten years. Better still, build a $500 million FPGA machine that is designed so it can be used for ten years. Then build ten of them for $5 billion or a hundred of them for $50 billion and solve a real grand challenge problem. Who the hell cares if a supercomputer can play Go? It won’t matter to us when we are dead of cancer. Solve a real problem. And then solve the next one. And the next one. . . .
The key thing here – and this is the key – is that having spent all of this money on the National AI and HPC Facility, the smartest minds can and must create the best code that has ever been written to actually solve the grand challenge problems. This is the hard part, this is where generative AI might really help. All we need is a vision and a budgetary commitment, and a belief that we can solve problems.
We believe. Do you?
Chief,
I think these rates are extremely expensive.
Somewhere in Eastern Europe we have them like this:
the cost of monthly utility bills with VAT (USD)
– internet (500Mbits) and cable TV 13.5
– netflix 11.0
– mobile phone #1 8.0
– mobile phone #2 8.0
– mobile phone #3 13.0
Total 53.5
“Maybe we can build one immense cloud-style real AI-HPC hypercomputer”
Another approach would be to design a dedicated, small computer, like an appliance, say the size of a microwave oven, and install it in the homes of those who want a $249USD tax cut from the IRS per year. And if you take 4 computing consoles then the value of the tax deduction is multiplied by 4. Or 5 times.
And all these computing consoles would form a giant distributed computer. It probably wouldn’t have the performance of a HPC hpercomputer located in a dedicated site, but I think it’s an idea worth putting forward.
I have long joked that every grill in every fast food restaurant should be powered by a supercomputer. These could be home heating units, too!
And use Coca-Cooling systems to keep them from overheating.
All kidding aside, a government entity, as you described, could lease space in the data centers of large FAANGs on US mainland and place a few racks in there to do distributed computing folding@home style.
A huge distributed HPC machine has some important advantages:
//1 can be designed and manufactured in a short time
//2 it can use parts from different suppliers
//3 each computing unit can have a different hardware architecture (CPUs, GPUs, ASICs, FPGAs, memory, storage etc.)
//4 can be upgraded in a continuous loop without shutting down the whole machine.
//5 can start small, with a small budget and grow organically, trying to adapt to new requirements as they arise.
My opinion is supercomputing has already changed the world in significant ways. It’s just that–like most science–people don’t think about and don’t understand where all the changes came from and how. I certainly don’t, but the changes are astonishing.
On the other hand, it looks like the nuclear test ban treaty, which motivated the construction of a number of the largest supercomputers, may be about over, as it’s not really a treaty if only one side abides by it.
A nicely inspirational article for the start of the new year. Here’s to hoping we clear major milestones on the TNP list of grand challenge problems in 2024! It’ll be great to see that we clear the 1 EF/s mark more decisively this year, with El Capitan, and Aurora, and Jupiter (maybe Venado too, not sure of its expected perf. though) — a slightly longer wait than expected since Frontier’s debut 18 months ago.
One thing to remember about simulating physical problems in 3-D is that each time we improve spatial resolution (uniformly) by a factor of 2, say in the x, y, and z directions, we need 2x2x2 = 8 times more nodes (simultaneous equations to solve), and possibly twice as many time steps as well (to maintain stability and accuracy constraints). So, improving resolution by a factor of 10x needs essentially 10,000 times the computational power of what is currently available, possibly 10 ZettaFlops/s. We’ll want to be sure to invest in that goal, especially now that the ExaFlop is “more common” and better understood, including leveraging MxP and/or specialized computational stencils (https://www.nextplatform.com/2022/04/25/oil-and-gas-industry-to-get-its-own-stencil-tensor-accelerator/).
Very inspirational to at least one young former NASA physicist-turned-large scale A.I. developer…
100% agreed on the core argument: public investment in supercomputing (and the simulation capabilities that brings, for solving the world’s hardest problems) can and should be multiplied by orders of magnitude.
There is a very large (and literally “powerful”) community of scientists/engineers/… rooted in the tech startup world, that you may know of called HackerNews, who I think will love to read this highly original, provocative, and visionary article. I’ve posted this article there, if you upvote it, then it’s more likely that hundreds of others will see this, and a very interesting discussion will follow: https://news.ycombinator.com/item?id=38907195
By the way, you can thank Google’s Discover algorithm for having recommended this article to me, I guess with the indexing on my end of working closely now with the University of Florida’s HiPerGator supercomputer.
Do we need to reinvigorate the use of BOINC (or a similar framework) to solve the real-world problems using everybody’s idle system at home?
Why not?
In response to a reinvigorated BOINC, it would be interesting if a disaggregated AI training algorithm insensitive to network bandwidth and latency could be developed that worked on a distributed grid of home computers.
Although I don’t have the insight to know why current training techniques need so much bandwidth and memory, it seems to me that it might be possible to train portions of a larger neutral network independently. In fact, I recall attending a thesis defense that touched on such things last year.
perhaps the quantum computer will arrive before the zetta computer, and until them the big cloud providers lead the race in terms of flops and dollars