Will HPC Be Eaten By Hyperscalers And Clouds?

Some of the most important luminaries in the HPC sector have spoken from on high, and their conclusions about the future of the HPC market are probably going to shock a lot of people.

In a paper called Reinventing High Performance Computing: Challenges and Opportunities, written by Jack Dongarra at the University of Tennessee and Oak Ridge National Laboratory – Jack, didn’t you retire? –  and Dan Reed at the University of Utah and Dennis Gannon formerly of Indiana University and Microsoft, we get a fascinating historical view of HPC systems and then some straight talk about how the HPC industry needs to collaborate more tightly with the hyperscalers and cloud builders for a lot of technical and economic reasons.

Many in the HPC market have no doubt been thinking along the same lines. It is in the zeitgeist, being transported on the cosmic Ethernet. (And sometimes InfiniBand where low latency matters.) And by the way, we are not happy about any of this, as we imagine you are not either. We like the diversity of architectures, techniques, and technologies that the HPC market has developed over the years. But we also have to admit that the technology trickle down effect where advanced designs eventually make their way down into large enterprises and then everywhere else did not happen at the speed or to the extent that we had hoped it might over the decades we have been watching this portion of the IT space.

As usual, the details of the scenario painted in this paper and the conclusions that the authors draw are many and insightful, and we agree wholeheartedly that there are tectonic forces at play in the upper echelons of computing. Frankly, we founded The Next Platform with this idea in mind and used the same language, and in recent years have also wondered how long the market for systems that are tuned specifically for HPC simulation and modeling would hold out against the scale of compute and investment by the hyperscalers and cloud builders of the world.

The paper’s authors have a much better metaphor for contrasting large-scale HPC system development, and that is to look at it like a chemical reaction. HPC investments, especially for capability-class machines, are endothermic, meaning they require infusions of capital from governments and academia to cover the engineering costs of designing and producing advanced systems. But investments in large-scale machinery at the hyperscalers and cloud builders are exothermic, meaning they generate cash – among the Magnificent Seven of Amazon, Microsoft, Google, Facebook, Alibaba, Baidu, and Tencent, it is enormous amounts of money. We would go so far as to say that the reaction is volcanic among the hyperscalers and cloud builders, which is exothermic with extreme attitude. Enough to melt rock and build mountains.

The geography of the IT sector has been utterly transformed by these seven continents of compute, and we all know it, and importantly, so does the HPC community that is trying to get to exascale and contemplating 10 exascale and even zettascale.

“Economies of scale first fueled commodity HPC clusters and attracted the interest of vendors as large-scale demonstrations of leading edge technology,” the authors write in the paper. “Today, the even larger economies of scale of cloud computing vendors has diminished the influence of high-performance computing on future chip and system designs. No longer do chip vendors look to HPC deployments of large clusters as flagship technology demonstrations that will drive larger market uptake.”

The list of truisms that Dongarra, Reed, and Gannon outline as they survey the landscape is unequivocal, and we quote:

  • Advanced computing of all kinds, including high-performance computing, requires ongoing non-recurring engineering (NRE) investment to develop new technologies and systems.
  • The smartphone and cloud services companies are cash rich (i.e., exothermic), and they are designing, building, and deploying their own hardware and software infrastructure at unprecedented scale.
  • The software and services developed in the cloud world are rich, diverse, and rapidly expanding, though only some of them are used by the traditional high-performance computing community.
  • The traditional computing vendors are now relatively small economic players in the computing ecosystem, and many are dependent on government investment (i.e., endothermic) for the NRE needed to advance the bleeding edge of advanced computing technologies.
  • AI is fueling a revolution in how businesses and researchers think about problems and their computational solution.
  • Dennard scaling has ended and continued performance advances increasingly depend on functional specialization via custom ASICs and chiplet-integrated packages.
  • Moore’s Law is at or near an end, and transistor costs are likely to increase as features sizes continue to decrease.
  • Nimble hardware startups are exploring new ideas, driven by the AI frenzy.
  • Talent is following the money and the opportunities, which are increasingly in a small number of very large companies or creative startups.

There is no question that the HPC and hyperscaler/cloud camps have been somewhat allergic to each other over the past decade or two, although there has been some cross pollination in recent years, with both people and technologies from the HPC sector being employed by the hyperscalers and cloud – mostly to attract HPC simulation and modeling workloads, but also because of the inherent benefits of technologies such as MPI or InfiniBand when it comes to driving the key machine learning workloads that have made the hyperscalers and cloud builders standard bearers for the New HPC. They didn’t invent the ideas behind machine learning – the phone company did – but they did have the big data and the massive compute scale to perfect it, and they are also going to be the ones building the metaverse – or metaverses – that are really just humungous simulations, driven by the basic principles of physics, done in real time.

What it comes down to is that standalone HPC in the national and academic labs takes money and has to constantly justify the architectures and funding for the machines that run their codes, and that the traditional HPC vendors – so many of them are gone now – could not generate enough revenue, much less profit, to stay in the game. HPC vendors were more of a public-private partnership than Wall Street ever wanted to think about or the vendors ever wanted to admit. And when they made any profits, it was never sustainable – just like being a server OEM is getting close to not being sustainable due to the enormous buying power of the hyperscalers and cloud builders.

We will bet a Cray-1 supercomputer assembled from parts acquired on eBay that the hyperscalers and cloud builders will figure out how to make money on HPC, and they will do it by offering applications as a service, not just infrastructure. National and academic labs will partner there and get their share of the cloud budget pool, and in some cases where data sovereignty and security are particularly high, the clouds will offer HPC outposts or whole dedicated datacenters, shared among the labs and securely away from other enterprise workloads. And moreover, the cloud makers will snap up the successful AI hardware vendors – or design their own AI accelerator chips as AWS and Google do – and the HPC community will learn to port their routines to these devices as well as CPUs, GPUs, and FPGAs. In the longest of runs, people will recode HPC algorithms to run on a Google TPU or an AWS Tranium. No, this will not be easy. But HPC will have to ride the coattails of AI because otherwise it will diverge from the same hardware path and not be an affordable endeavor.

As they prognosticate about the future of HPC, Dongarra, Reed, and Gannon outline the following six maxims that should be used to guide its evolution:

Maxim One: Semiconductor constraints dictate new approaches. There are constraints from Moore’s Law slowing and Dennard scaling stopping, but it is more than that. We have foundry capacity issues and geopolitical problems arising from chip manufacturing, as well as the high cost of building chip factories, and there will need to be standards of interconnecting chiplets to allow ease of integration of diverse components.

Maxim Two: End-to-end hardware/software co-design is essential. This is a given for HPC, and chiplet interconnect standards will help here. But we would counter that the hyperscalers and cloud builders limit the diversity of their server designs to drive up volumes. So just like AI learned to run on HPC iron back in the late 2000s, HPC will have to learn to run on AI iron of the 2020s. And that AI iron will be located on the clouds.

Maxim Three: Prototyping at scale is required to test new ideas. We are not as optimistic as Dongarra, Reed, and Gannon that HPC-specific systems will be created – much less prototyped at scale – unless one of the clouds corners the market on specific HPC applications. Hyperscalers bend their software to fit cheaper iron, and they only create unique iron with homegrown compute engines when they feel they have no choice. They will adopt mainstream HPC/AI technologies every time, and HPC researchers are going to have to make do. In fact, that will be largely what the HPC jobs of the future will be: Making legacy codes run on new clouds.

Maxim Four: The space of leading edge HPC applications is far broader now than in the past. And, as they point out, it is broader because of the injection of AI software and sometimes hardware technologies.

Maxim Five. Cloud economics have changed the supply chain ecosystem. Agreed, wholeheartedly, and this changes everything, even if cloud capacity costs 5X to 10X as much as running it on premises, the cloud builders have so much capacity that every national and academic lab could be pulling in the same direction as they modernize codes for cloud infrastructure – and where it is sitting doesn’t matter. What matters is changing from CapEx to OpEx.

Maxim Six: The societal implications of technical issues really matter. This has always been a hard sell to the public, even if scientists get it, and the politicians of all the districts where supercomputing labs exist certainly don’t want HPC centers to be Borged into the clouds. But, they will get to brag about clouds and foundries, so they will adapt.

“Investing in the future is never easy, but it is critical if we are to continue to develop and deploy new generations of high-performance computing systems, ones that leverage economic shifts, commercial practice, and emerging technologies. Let us be clear. The price of innovation keeps rising, the talent is following the money, and many of the traditional players – companies and countries – are struggling to keep up.”

Welcome to the New HPC.

Author’s Note: We read a certain amount of technical papers here at The Next Platform, and one of the games we like to play when we come across an interesting paper is to guess what it will conclude before we even read it.

This is an old – and perhaps bad – habit learned from a physics professor many decades ago, where he admonished us to figure out the nature of the problem and estimate an answer, in ours heads and out loud before the class, before we actually wrote down the first line to solve the problem. This was a form of error detection and correction, which is why we were taught to do this. And it kept you on your toes, too, because the class was at 8 am and you didn’t know you were going to have to solve a problem until the professor threw a piece of chalk to you. (Not at you, but to you.)

So when we came across the paper outlined above, we immediately went into speculative execution mode and this popped out:

“Just like we can’t really have a publicly funded mail service that works right or a publicly funded space program that works right, in the long run we will not be able to justify the cost and hassle of bespoke HPC systems. The hyperscalers and cloud builders can now support HPC simulation and modeling and have the tools, thanks to the metaverse, to not only calculate a digital twin of the physical world – and alternate universes with different laws of physics if we want to go down those myriad roads – but to allow us to immerse ourselves in it to explore it. In short, HPC centers are going to be priced out of their own market, and that is because of the fundamental economics of the contrast between hyperscaler and HPC center. The HPC centers of the world drive the highest performance possible for specific applications at the highest practical budget, whereas the hyperscalers always drive performance and thermals for a wider set of applications at the lowest cost possible. The good news is that in the future HPC sector, scientists will be focusing on driving collections of algorithms and libraries, not on trying to architect iron and fund it.”

We were pretty close.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

10 Comments

  1. I Believe that this article is a wishlist from someone in AWS or Google, or someone without the budget to acquire an HPC System. The companies or Institutions that need answers to complex problems are not going to wait for all the jargon mentioned here, and wishful thinking, they need answers now, fluid dynamic models now, not in years to come. So I call a bluff of this article, HPC is a 40+ billion a Year industry that has been growing Year over Year, I see this on a day to day scenario, and it only is growing!!!!

    • Seems to me the HPC people can have that now or in the very near future and as the article points out, at much better economics – not years and years away. That’s mainly because of what is offered today and the quick rate of further development taking place in AI. Did you see how NVDA opted to not focus on the FP64 market with latest GPUs? I imagine that is because the AI they’ve already managed to achieve together with partners, provides the needed answers, just in a different way. Someone likened the AI approach to having done the math in one’s head vs. the sort of brute strength approach with emphasis on FP 64 precision as being longhand math on paper. While the answers will be the same, one is much more expensive in terms of compute power, and unnecessary. I find that an apt analogy.

    • I think GreenLake is a kind of outpost, and it has traction and a chance. But that only addresses the infrastructure layer. The addition of Cray certainly gives HPE more longevity because of all of those skills in HPC, particularly with workload expertise. But consider that many of the key architects of Cray now work at Microsoft Azure.

  2. The HPC market is significantly larger than the niche systems located at national labs, the occasional university, and within three-letter agencies. Vast majority of HPC workloads and applications don’t require esoteric hardware or the highest bandwidth, lowest latency interconnects (most operate over vanilla Ethernet though certainly the Cray-HPE Slingshot Ethernet delivers both as well as predictable performance under very high loads which most Ethernet and prior to recent improvements in Nvidia’s Infiniband implementation do not).

    This is not to say that there isn’t room for custom hardware components (integrated, chiplet, or wafer-based), but that hardware needs to seamlessly provide its value without requiring application and workload modification. Let’s face it, HPC’s Achilles Heel has always been its legacy (some would say antiquated or perhaps decrepit) software stacks which take anywhere from 9-18 months to port before much but never all of the full potential of any new hardware or system can be realized. When you combine system delivery delays (how often has any custom HPC system been delivered on time), the huge porting cost and time delays, and the rapid performance gains of the underlying hardware (distributed HPC systems don’t use or cannot tolerate rolling hardware upgrades which means for much of their very limited productive years, they use 2+ generations of lagging hardware), one has to question their economic viability. Add into the mix politics driven by many who have little to no understanding of the technology, science, or potential benefits to humanity, and the current custom HPC operating model becomes extremely questionable.

    In contrast, cloud providers have mastered supply chains leading to massive economies of scale, scale-out system / solution management with fully integrated security and resiliency, and the ability to rapidly and seamlessly integrate new hardware and capabilities (lagging hardware is quickly redeployed downstream to less demanding applications ensuring that demanding applications operate on the best at hand). If they see a viable market or an opportunity through public-private funding, they are more than willing to invest to deliver what their customers need. Some cloud providers are even certified to provide on-premises gear within high security / sensitive environments including various three-letter agencies.

    Cloud providers are far from perfect, but they can support nearly all but perhaps the most extreme niche HPC applications and their capabilities and benefits from multiple economic and technology perspectives far outweigh their shortcomings. Industry standards are critical to defining the mechanical and communication edges, but they need to be carefully designed to enable not hinder innovation. Far, far too often hardware and semi-conductor companies driving industry standards try to lock down, and all too often, artificially delay specifications and innovations to meet their own business requirements. Such efforts have slowed and constrained innovation. Further, such efforts have led to overly complex standards which unfortunately leads to many non-fully interoperable and compliant components. Fortunately, some of this is starting to change, e.g., the DMTF specifies a wide range of flexible data models that abstract the underlying hardware implementation which can dramatically simplify software-hardware integration while accelerating innovation and new service and capability delivery.

  3. Yes you nailed it and made me chuckle. To put it simply the PC ate all the home computers in the 1980’s then ate all the mini computers then ate all the Cray computers. My PC has the power of a super computer if I go back enough years yet it’s basically the same PC and PC concept from the 1980’s. Yes we may call it a server but it’s just a PC adapted to the rack rather than the desktop. The fancy exotic super computers simply can’t keep up technically or commercially. However I do think there is a lot of scope for building a massive processor out of FPGAs. These are somewhat faster than CPUs for many tasks because it’s programmable hardware which runs the algorithm in logic gates.

    Anyway it is a shame to see HPC get swallowed up by a cloud of rack mounted PCs but I suspect there will always be a lab somewhere who are building something so different that they even have to make their own parts.

  4. There are a some roadblocks standing in the way of this thesis. One is an old truism “nothing grows to the sky.” The other is that there is no free lunch. Every IT organization, cloud or on-prem, has to have slack capacity in order to handle workload spikes (both anticipated and unanticipated). The larger your infrastructure, the more slack capacity (in real terms, not proportionate) you have to have to handle the spikes. This is very expensive and will have to be priced into the cloud fee schedule.

    In our research, we’ve heard several things that argue against the cloud being a suitable host for HPC today, here are a few of the biggest: 1) You can’t get enough instances in the same region for a large HPC workload…2) You can’t significantly reduce your IT staffing just because you go to the cloud – you still need app specialists, storage specialists, tuning specialists, etc., etc., and 3) The costs of public cloud are 3-7x the fully burdened costs of on-prem.

    It’s interesting to consider recent moves by Amazon and Google to extend their system life by an additional year. They added 25% to the useful life of their systems. There’s a reason for this. Could it be that their reading of the tea leaves doesn’t have them taking over the IT world? I don’t know, but it’s something to consider.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.