Will HPC Be Eaten By Hyperscalers And Clouds?

Timothy Prickett Morgan

3 years ago

Some of the most important luminaries in the HPC sector have spoken from on high, and their conclusions about the future of the HPC market are probably going to shock a lot of people.

In a paper called Reinventing High Performance Computing: Challenges and Opportunities, written by Jack Dongarra at the University of Tennessee and Oak Ridge National Laboratory – Jack, didn’t you retire? – and Dan Reed at the University of Utah and Dennis Gannon formerly of Indiana University and Microsoft, we get a fascinating historical view of HPC systems and then some straight talk about how the HPC industry needs to collaborate more tightly with the hyperscalers and cloud builders for a lot of technical and economic reasons.

Many in the HPC market have no doubt been thinking along the same lines. It is in the zeitgeist, being transported on the cosmic Ethernet. (And sometimes InfiniBand where low latency matters.) And by the way, we are not happy about any of this, as we imagine you are not either. We like the diversity of architectures, techniques, and technologies that the HPC market has developed over the years. But we also have to admit that the technology trickle down effect where advanced designs eventually make their way down into large enterprises and then everywhere else did not happen at the speed or to the extent that we had hoped it might over the decades we have been watching this portion of the IT space.

As usual, the details of the scenario painted in this paper and the conclusions that the authors draw are many and insightful, and we agree wholeheartedly that there are tectonic forces at play in the upper echelons of computing. Frankly, we founded The Next Platform with this idea in mind and used the same language, and in recent years have also wondered how long the market for systems that are tuned specifically for HPC simulation and modeling would hold out against the scale of compute and investment by the hyperscalers and cloud builders of the world.

The paper’s authors have a much better metaphor for contrasting large-scale HPC system development, and that is to look at it like a chemical reaction. HPC investments, especially for capability-class machines, are endothermic, meaning they require infusions of capital from governments and academia to cover the engineering costs of designing and producing advanced systems. But investments in large-scale machinery at the hyperscalers and cloud builders are exothermic, meaning they generate cash – among the Magnificent Seven of Amazon, Microsoft, Google, Facebook, Alibaba, Baidu, and Tencent, it is enormous amounts of money. We would go so far as to say that the reaction is volcanic among the hyperscalers and cloud builders, which is exothermic with extreme attitude. Enough to melt rock and build mountains.

The geography of the IT sector has been utterly transformed by these seven continents of compute, and we all know it, and importantly, so does the HPC community that is trying to get to exascale and contemplating 10 exascale and even zettascale.

“Economies of scale first fueled commodity HPC clusters and attracted the interest of vendors as large-scale demonstrations of leading edge technology,” the authors write in the paper. “Today, the even larger economies of scale of cloud computing vendors has diminished the influence of high-performance computing on future chip and system designs. No longer do chip vendors look to HPC deployments of large clusters as flagship technology demonstrations that will drive larger market uptake.”

The list of truisms that Dongarra, Reed, and Gannon outline as they survey the landscape is unequivocal, and we quote:

Advanced computing of all kinds, including high-performance computing, requires ongoing non-recurring engineering (NRE) investment to develop new technologies and systems.
The smartphone and cloud services companies are cash rich (i.e., exothermic), and they are designing, building, and deploying their own hardware and software infrastructure at unprecedented scale.
The software and services developed in the cloud world are rich, diverse, and rapidly expanding, though only some of them are used by the traditional high-performance computing community.
The traditional computing vendors are now relatively small economic players in the computing ecosystem, and many are dependent on government investment (i.e., endothermic) for the NRE needed to advance the bleeding edge of advanced computing technologies.
AI is fueling a revolution in how businesses and researchers think about problems and their computational solution.
Dennard scaling has ended and continued performance advances increasingly depend on functional specialization via custom ASICs and chiplet-integrated packages.
Moore’s Law is at or near an end, and transistor costs are likely to increase as features sizes continue to decrease.
Nimble hardware startups are exploring new ideas, driven by the AI frenzy.
Talent is following the money and the opportunities, which are increasingly in a small number of very large companies or creative startups.

There is no question that the HPC and hyperscaler/cloud camps have been somewhat allergic to each other over the past decade or two, although there has been some cross pollination in recent years, with both people and technologies from the HPC sector being employed by the hyperscalers and cloud – mostly to attract HPC simulation and modeling workloads, but also because of the inherent benefits of technologies such as MPI or InfiniBand when it comes to driving the key machine learning workloads that have made the hyperscalers and cloud builders standard bearers for the New HPC. They didn’t invent the ideas behind machine learning – the phone company did – but they did have the big data and the massive compute scale to perfect it, and they are also going to be the ones building the metaverse – or metaverses – that are really just humungous simulations, driven by the basic principles of physics, done in real time.

What it comes down to is that standalone HPC in the national and academic labs takes money and has to constantly justify the architectures and funding for the machines that run their codes, and that the traditional HPC vendors – so many of them are gone now – could not generate enough revenue, much less profit, to stay in the game. HPC vendors were more of a public-private partnership than Wall Street ever wanted to think about or the vendors ever wanted to admit. And when they made any profits, it was never sustainable – just like being a server OEM is getting close to not being sustainable due to the enormous buying power of the hyperscalers and cloud builders.

We will bet a Cray-1 supercomputer assembled from parts acquired on eBay that the hyperscalers and cloud builders will figure out how to make money on HPC, and they will do it by offering applications as a service, not just infrastructure. National and academic labs will partner there and get their share of the cloud budget pool, and in some cases where data sovereignty and security are particularly high, the clouds will offer HPC outposts or whole dedicated datacenters, shared among the labs and securely away from other enterprise workloads. And moreover, the cloud makers will snap up the successful AI hardware vendors – or design their own AI accelerator chips as AWS and Google do – and the HPC community will learn to port their routines to these devices as well as CPUs, GPUs, and FPGAs. In the longest of runs, people will recode HPC algorithms to run on a Google TPU or an AWS Tranium. No, this will not be easy. But HPC will have to ride the coattails of AI because otherwise it will diverge from the same hardware path and not be an affordable endeavor.

As they prognosticate about the future of HPC, Dongarra, Reed, and Gannon outline the following six maxims that should be used to guide its evolution:

Maxim One: Semiconductor constraints dictate new approaches. There are constraints from Moore’s Law slowing and Dennard scaling stopping, but it is more than that. We have foundry capacity issues and geopolitical problems arising from chip manufacturing, as well as the high cost of building chip factories, and there will need to be standards of interconnecting chiplets to allow ease of integration of diverse components.

Maxim Two: End-to-end hardware/software co-design is essential. This is a given for HPC, and chiplet interconnect standards will help here. But we would counter that the hyperscalers and cloud builders limit the diversity of their server designs to drive up volumes. So just like AI learned to run on HPC iron back in the late 2000s, HPC will have to learn to run on AI iron of the 2020s. And that AI iron will be located on the clouds.

Maxim Three: Prototyping at scale is required to test new ideas. We are not as optimistic as Dongarra, Reed, and Gannon that HPC-specific systems will be created – much less prototyped at scale – unless one of the clouds corners the market on specific HPC applications. Hyperscalers bend their software to fit cheaper iron, and they only create unique iron with homegrown compute engines when they feel they have no choice. They will adopt mainstream HPC/AI technologies every time, and HPC researchers are going to have to make do. In fact, that will be largely what the HPC jobs of the future will be: Making legacy codes run on new clouds.

Maxim Four: The space of leading edge HPC applications is far broader now than in the past. And, as they point out, it is broader because of the injection of AI software and sometimes hardware technologies.

Maxim Five. Cloud economics have changed the supply chain ecosystem. Agreed, wholeheartedly, and this changes everything, even if cloud capacity costs 5X to 10X as much as running it on premises, the cloud builders have so much capacity that every national and academic lab could be pulling in the same direction as they modernize codes for cloud infrastructure – and where it is sitting doesn’t matter. What matters is changing from CapEx to OpEx.

Maxim Six: The societal implications of technical issues really matter. This has always been a hard sell to the public, even if scientists get it, and the politicians of all the districts where supercomputing labs exist certainly don’t want HPC centers to be Borged into the clouds. But, they will get to brag about clouds and foundries, so they will adapt.

“Investing in the future is never easy, but it is critical if we are to continue to develop and deploy new generations of high-performance computing systems, ones that leverage economic shifts, commercial practice, and emerging technologies. Let us be clear. The price of innovation keeps rising, the talent is following the money, and many of the traditional players – companies and countries – are struggling to keep up.”

Welcome to the New HPC.

Author’s Note: We read a certain amount of technical papers here at The Next Platform, and one of the games we like to play when we come across an interesting paper is to guess what it will conclude before we even read it.

This is an old – and perhaps bad – habit learned from a physics professor many decades ago, where he admonished us to figure out the nature of the problem and estimate an answer, in ours heads and out loud before the class, before we actually wrote down the first line to solve the problem. This was a form of error detection and correction, which is why we were taught to do this. And it kept you on your toes, too, because the class was at 8 am and you didn’t know you were going to have to solve a problem until the professor threw a piece of chalk to you. (Not at you, but to you.)

So when we came across the paper outlined above, we immediately went into speculative execution mode and this popped out:

“Just like we can’t really have a publicly funded mail service that works right or a publicly funded space program that works right, in the long run we will not be able to justify the cost and hassle of bespoke HPC systems. The hyperscalers and cloud builders can now support HPC simulation and modeling and have the tools, thanks to the metaverse, to not only calculate a digital twin of the physical world – and alternate universes with different laws of physics if we want to go down those myriad roads – but to allow us to immerse ourselves in it to explore it. In short, HPC centers are going to be priced out of their own market, and that is because of the fundamental economics of the contrast between hyperscaler and HPC center. The HPC centers of the world drive the highest performance possible for specific applications at the highest practical budget, whereas the hyperscalers always drive performance and thermals for a wider set of applications at the lowest cost possible. The good news is that in the future HPC sector, scientists will be focusing on driving collections of algorithms and libraries, not on trying to architect iron and fund it.”

We were pretty close.