Back in 2009, when yours truly was assigned the primary beat of covering supercomputing on remote hardware (then dubbed the mysterious “cloud”), the possibility that cloud-based high performance computing was little more than a pipe dream.
At that time, most scientific and technical computing communities had already developed extensive grids to extend their research beyond physical borders, and the idea of introducing new levels of latency, software, and management interfaces did not appear to be anything most HPC centers were looking forward to—even with the promise of cost-savings (as easy “bursting” was still some time off).
Just as Amazon Web Services pioneered cloud in those early days, given its strong footing in that it already had the hardware and software infrastructure “sitting around” to sell off, it made great strides to bring its capabilities to the scientific computing community–even without a sense of how such investments would pay off. It was not an easy sell for the high performance computing folks initially (for the reasons listed above, as well as privacy and security, among others), but use case by use case, it was slowly proven out that the public cloud could be a valuable scientific computing resource—even just for occasional workloads.
AWS added a number of new capabilities over the years on the hardware side, culminating more recently with GPU instances for accelerated compute, 10 gigabit Ethernet to assuage the latency and bandwidth naysayers, and meatier CPU and memory instances as well, all designed for HPC. Although they generally like to announce new offerings for this community during Re:Invent or major HPC shows, today the cloud giant unveiled a fresh suite of SaaS tools for those with HPC workloads called Alces Flight, which extends the ease of executing Linux-based HPC applications on AWS iron.
Available via the AWS Marketplace, this is designed to be a quick-click HPC environment (assuming you’re not brand new to AWS–there is a bit of a curve otherwise) that can be spun up using either on-demand or spot instances. In addition to the scheduler and requisite middleware, there are hundreds of HPC applications that are ready to run. It is also possible, of course, to use the service to install your own applications on a Flight compute cluster. The service leverages Alces Gridware, which packages applications and libraries and puts them at user fingertips on physical clusters, and now the AWS cloud. These include the latest versions of Perl and Python, wrappers for GCC and commercial compilers, including Intel, PGI, and others, and several other popular environments, including OpenMPI, IntelMPI. Applications for nearly every scientific computing realm are included with several in chemistry (NAMD, Gromacs, CHARMM) and biosciences (BLAST, Mosaik, SAMtools, and more), as well as major benchmark and other verification tools, including LINPACK, STREAM, and others. The full list of HPC applications that can be spun up using the service can be found here.
Chief Evangelist at AWS, Jeff Barr, said today that “after designing and managing hundreds of HPC workflows for national and academic supercomputing centers in the UK, Alces built and validated HPC workflows tailored to researchers and automated applications built for supercomputing centers.” While Alces might not be a familiar name, even to those in HPC in the U.S. in particular, this kind of services has been important when HPC users have spun up other large clusters with full integration. Most well-known use cases of this in HPC have involved Cycle Computing, which handles the middleware, libraries, and applications for very large jobs on AWS (and now Google Compute Engine).
The Alces Flight services does take some familiarity with AWS, at least from viewing the getting started guide. It relies on Amazon CloudFormation at its core and leverages basic EC2 and EBS. It features the same auto-scaling capabilities that users are familiar with and also allows for the initial defining of the cluster.
AWS says the goal is to “rapidly deliver a whole HPC cluster, ready to go and complete with job scheduler and applications. Clusters are deployed in a VPC environment for security with SSH and graphical desktop connectivity for users. Data management tools for POSIX and S3 object storage are also included.”
For those looking to play, a small cluster of up to 8 nodes can be launched. For those who already subscribe to Alces services, the CloudFormation template can be downloaded from Alces and used to fire up the AWS resources required.
One can expect that AWS will continue to garner favor with HPC centers as well as commercial HPC sites looking to offload at least some of their peak-time compute. Among current HPC users are the Jet Propulsion Laboratory, Pfizer, and a number of universities. AWS has most recently added their C4 instance type featuring a variant of the latest Haswell processors, one of which offers up to 36 vCPUs. They have also added “Placement Groups” for virtual clusters for latency-sensitive jobs. And over the years that AWS Marketplace has been alive, the HPC section continues to grow with 67 services and packages for the supercomputing set.
While other infrastructure providers, including Microsoft, Google, and SoftLayer/IBM have also extended an arm for HPC end users, the marketplace approach AWS offers was the first to hook HPC end users—and it continuously focuses on this narrow, but important, subset of compute consumers. The question we’ve been poking at off and on over the years is how important the HPC market is for AWS overall. It has never been simple to get a sense of how many applications of this nature run, but if the best estimates are correct, it’s a relatively small subset of applications. Intersect360 Research, for instance, which focuses on the HPC market, found that cloud spends account for only around 3% of the budget allocations at HPC sites worldwide with fluctuations year to year–and very little momentum upward.
Currently
1) AWS instances are limited to a single 10GbE link per server – vs at least 2×40-56GbE or 1-2 56Gb IB in normal servers
2) the only available GPUs with decent FP64 speed and ECC are EXTREMELY outdated (Fermi M2050)
3) obtaining more than ~50-100 full size (.8xlarge) instances in the same AZ is neither automated nor guaranteed (JPL or Pfizer etc. may have special arrangements).
Given all that “supercomputing” on AWS sounds more like a marketing slogan – not reality.