Sky-High Hurdles, Clouded Judgements for IaaS at Exascale

Nicole Hemsoth Prickett

2 years ago

Back in 2009, I was the editor of a mini-side publication from supercomputing magazine, HPCwire, called HPC in the Cloud. At that time, the concept was new enough to warrant a formal division between what HPC was and what it would eventually be–and its own dedicated coverage.

The first ISC when this publication was new was an eye-opener because, quite frankly, people thought it was a passing fad. When I told people what it was covering, I got the quantum computing-circa-2000-esque eyeroll, particularly from the greybeards (you know who you are). After all, in bandwidth-hungry, latency-sensitive HPC, this could never have a place. Grid computing was fine, thank you.

But as the years rolled on, it always struck me as kind of funny that well over a decade later, the questions related to HPC/cloud haven’t changed much: Will latency make it a no-go? How big is the performance hit? Won’t it cost a fortune? How do you even compare costs to on-prem with so many facility costs to consider…the list went on then, and still goes on, much unchanged, today.

Enter Oak Ridge National Lab in 2023 asking these questions (again? still?) but with exascale in mind. Even though the questions are the same old theme, the findings are not. They’ve produced a detailed, and blessedly specific, set of conclusions about all the classic questions, but with enough technical context to make it a worthwhile read.

They stack up the big three: AWS, GCP, and Azure across five “leadership-class” sample workloads. For a real-world kick, those are run with time and budget limitations. The comparison is nuanced, it’s not looking for which is the best, but how did different configurations work and what were the consistent issues in scalability, which was the prime target (no I/O or other performance metrics).

Still, we all like a contest so roughly, based on their regroup, each of the cloud providers—AWS, Azure, and GCP—had their unique strengths and limitations, but there wasn’t a clear “best” performer across all metrics (these are my own reading between the lines from the report so don’t blame ORNL, I’m your culprit).

Azure was able to support up to 256 nodes for certain applications, the highest scalability mentioned among the three platforms and was the only platform with an operational GPU-aware MPI from with the highest node injection bandwidth at 200 GB/s.
AWS, while facing certain scalability restrictions, provided a robust configuration with NVIDIA A100 GPUs, Intel Xeon CPUs, and its custom Elastic Fabric Adapter (EFA) network interface, which was highlighted as useful.
GCP, although it couldn’t provide the initially intended A100 GPU nodes and had to pivot to V100 nodes, emphasized its flexibility and ease of setup with the Google HPC Toolkit.

But nothing is perfect and they all had specific challenges tied to resource allocation, performance debugging, and budget controls. In our view, the resource allocation problem was (not unexpectedly) nasty.

We all know the magical story about the cloud’s inherent elasticity in a land where resources get allocated dynamically based on job schedules. However, this churn led to bigtime inefficiencies at scale, with incremental allocations becoming a particular bugger. In scenarios where job demands superseded available nodes, the system would often allocate a subset, leading to idle nodes incurring substantial costs. To put this in perspective, the evaluation noted expenses nearing $50,000 on idle nodes.

This stark number was primarily due to inefficiencies stemming from the semantic gap, which is something the ORNL team discussed in detail. It basically refers to the disconnect between the in-house HPC environment and the underlying cloud infrastructure.

It’s a bit of a nebulous concept but here’s an example: This abstraction layer between the HPC environment and the core cloud infrastructure was particularly visible in cost management. AWS, for instance, imposed limitations of 32 A100 GPU nodes. Azure, despite its promise, grappled with scaling beyond 128 nodes. Meanwhile, GCP, originally targeting A100 GPU nodes, settled for the V100 due to availability constraints.

The team shares that an absence of granular budget controls further amplified these concerns. Typically, the existing cloud paradigm facilitates broader cluster-level oversight. However, with this approach, a single user could potentially deplete an entire project’s budget— a considerable risk in large-scale projects.

These problems add up in other dollars/cents ways that are worth a look. They found that using any of the three clouds for an HPC node incurs an approximate cost of $40/hour. In comparison, an on-prem exascale system like Frontier, for instance, estimates a node-hour rate of merely $2.35. That’s a mere 5.88% of the cloud’s computational expense.

This stark difference—close to 17 times in terms of public pricing—casts a considerable shadow over the economic viability of cloud platforms for large-scale HPC endeavors.

And we finally lead back to the question of performance, which always seems solved at the hardware level among the cloud providers. However, just having the network connectivity in place is not enough. The team points to features that only come in on-prem HPC as particularly important. For instance, on-prem HPC shops have in-depth monitoring capabilities. This granularity aids in swift bottleneck identification and resolution. But mind the gap (again) with the cloud’s layered abstractions, such insights are not so easy to come by. They point to AWS’s Elastic Fabric Adapter (EFA) as the root of one example—just to get the functionality needed meant close collaboration between ORNL and AWS support to optimize GPU-aware MPI, illustrating the challenges inherent in cloud-based HPC.

On the network and node architecture front, while AWS, Azure, and GCP mirrored the capabilities of some of the top-tier supercomputers in terms of node architecture, their network configurations are actually quite a bit different. ORNL cites cloud network scalability, with all the right hardware pieces but with capping around 1000 endpoints.

In terms of bandwidth, AWS delivered 50GB/s, GCP doubled to 100 GB/s, and Azure again doubled that with 200 GB/s. Such disparities, especially when scaling into thousands of nodes, can greatly influence the efficiency and integration of HPC operations and is another far cry from what HPC shops can get on-prem.

So, same questions, but different day in the grand scheme of things.

This report is worth your time. Few of you are deciding between on-prem/cloud for exascale-class workloads but it shows how and where things break at scale—and that’s worth the read alone.