Ten Years of AWS and a Status Check for HPC Clouds
March 15, 2016 Nicole Hemsoth
Six years ago at the International Supercomputing Conference over some fine German beer, I sat and talked with a group of leading high performance computing folks from national labs and research centers in the United States and Europe about what “this whole cloud thing” could mean for the future of supercomputing. With few exceptions, the resounding answer was that it was mostly meaningless—a lot of hype for nothing, at least as far as supercomputing was concerned. And after all, they had their scientific computing grids, which were a lot like cloud in terms of collaboration capabilities, but without all the specialization.
These folks had a point, especially then. For workloads based on the MPI model, fast node to node communication was critical. That lack of capability in 2010 alone was a deal-killer (or at least an HPC cloud conversation killer). For those few who had the most to gain from the opportunities poised by high performance computing in the cloud, the design and engineering folks who run HPC simulations to power daily business, the performance hit was one thing—the software licensing for expensive proprietary codes wasn’t being made available for cloud use. For any of these users too, the entire middleware layer—the on-ramping of complex applications with diverse workload requirements was too much to consider on top of the other limitations.
What is notable here is that over those years, Amazon Web Services listened, even to a relatively small community of end users—the HPC crowd. They added GPU instances, which was certainly appealing to some running HPC simulations in the cloud, as well as 10 gigabit Ethernet for latency-sensitive workloads, not to mention a number of compute-intensive instances with varying degrees of oomph and memory-heavy types as well. While it’s true these serve other workloads outside of HPC, the company has done a noteworthy job of making the HPC cloud connection over the years. As someone whose job it’s been to watch this closely, a decade in, it is hard not to be impressed by the effort. But to what extent have those efforts to create a more HPC friendly cloud environment actually paid off?
Amazon Web Services turned ten today, and while the technology and business story has been a compelling one to watch over the years, for one segment of its user base, the real value of cloud for HPC is still an evolving story. Here’s the interesting thing though. The adoption curve for HPC applications on the public cloud is far less about what Amazon (and its “competitors” in the space) does to create the right environment–and far more about what the existing HPC ecosystem does for the cloud. It’s not about tooling, it’s about culture.
If AWS and its public cloud compatriots fail to attract big HPC users, it’s probably not because the critical hardware is missing, or even the right middleware. It’s because the culture is missing. That means the way that research HPC centers look at their workloads and where they run, all the way down to how ISVs decide how to bandy about their licensing. Because after all, if you have highly specialized HPC code you’re peddling based on per-node pricing, you know damned well you don’t “have” to make it available as a cloud offering if it doesn’t fit your business model. Sure, your users can work with open source engineering, design, and other codes, but they don’t do what yours does—and the people who work at the companies that can afford these expensive licenses have invested a great deal in the people and software partnership already.
And speaking of culture and investment, HPC sites, in commercial sectors or research, invest in hardware. Period. They have done for that for many years, thus a datacenter investment is not considered a one-time thing. It’s ongoing. There is a culture of people, of datacenter ownership, of stewardship of the machines. Faced with the option of cheaper full-run in the public cloud, this isn’t a question for several businesses. Even if the economics worked out, it would still be a monumental decision.
So where is the heralded public cloud for HPC? Where are those long-promised full-scale production workloads humming away on Amazon infrastructure? Well, they’re there, but in any cases with those “true HPC” MPI, latency-sensitive jobs, it’s just sometimes–and only when centers need to avoid buying more iron.
Real HPC cloud is found in the bursting model. It’s hybrid. And in that use case, at scale anyway, it is the perfect story. HPC hardware is not cheap to buy, lease, maintain, staff, and click licenses to. Assuming the licenses come along for the hybrid cloud ride (an increasing number do), seeing that peak needs require an extra 100 cores twice per month, for instance, and making the decision to simply automatically burst a workload makes good business sense. And ten years in, AWS has given enough options that the environment will not look that much different.
Because here’s another benefit for HPC users on the AWS cloud. Chances are quite good, given long upgrade cycles (4-6 years, depending on the shop), that AWS has better hardware than you do. Yes, there’s a latency hit. No, your commercial codes might not have licenses that transfer. Yes, data movement is an expensive thing for large simulation workloads. But. But…to test, develop, and run in burst mode on the latest greatest hardware on demand? How is this not the most compelling story to hit HPC in theory, if not practice, in years?
The data does not favor the view that cloud will burst into HPC budgets, however. For instance, Intersect360 Research, which publishes in-depth reports based on many high performance computing sites around the world shows an interesting flatline for cloud spending.
Of course, if bursting is the primary use case for many users with large HPC installations, it is fair to assume the budget room would be small since it’s occasional use–and likely very inexpensive. While it will never be cost-effective to simply rent a full HPC environment equal to one hundred or more nodes for full production use in the cloud exclusively, that 3% budget room they see is enough to save an HPC site many thousands of dollars that would have gone into server over-provisioning for peak needs.
There will never be a replacement for large-scale supercomputers. While the mode of computation may change over the coming decades, outsourcing complex simulations to a cloud resource will not make economic sense, among other mismatches. But for smaller-scale high performance computing (HPC) it continues to be an attractive option for full-bore handling of HPC workloads, but more often than not, for “bursty” workloads at centers where over-provisioning HPC systems to handle peak demand times was the norm for years.
But happy decade to AWS, which has worked hard to capture and understand the needs of a small but important computing community. Enlightened self-interest or not, it has given HPC hardware vendors a run for their money—and has forced ISVs into thinking about how their licenses work in an era of “ubiquitous computing”.
As for the use cases for HPC cloud, they mount with each passing year, many focused on life sciences and manufacturing. As of now, the most compelling place to go to understand all of these different adoption models is the UberCloud project, run by Wolfgang Gentzsch, whom incidentally, was the only one over those beers in 2010 who wasn’t saying the cloud was only fluff for HPC.