HPC shops are used to doing math – it is what they do for a living, after all – and as they evaluate their hybrid computing and storage strategies, they will be doing a lot of math. And a lot of experimentation. And that is because no one can predict with anywhere near the kinds of precision that HPC shops tend to like just how their applications and their users are going to make use of on premise and cloud capacity.
This may all settle out as experience builds, but right now – in the early days of practical HPC processing and data storage in the cloud – there are more hard questions than solid answers. Indeed, it may never precisely settle out, because everything is always changing. Cloud brings many new and interesting options, even if it does add complexity.
While HPC centers are understandably focused on the amount and the nature of the compute that they can bring to bear both from their own clusters and those they rent from cloud-service providers, it’s data that’s driving the choices people make on cloud and that’s helping seed uptake of hybrid.
“To a large extent, the location of the data and its size really determines the location of the compute,” Rob Lalonde, vice president and general manager of the cloud division at Univa, tells The Next Platform. “If there is a petabyte dataset sitting on premises or in the cloud, the odds are that you are not going to move it. If a job is suddenly very high priority and the dataset is small, then you may move it off the cluster on premises and up to the cloud where a lot more compute resources are available to get it done faster – provided the job scales, of course.”
One area where uptake of cloud computing – and thus hybrid – among HPC shops makes perfect sense is in life sciences, where researchers do not want to run clusters of their own and where their datasets are relatively small. Here public datasets are very large and tend to be stored on Amazon Web Services, Microsoft Azure, or Google Cloud so they can be shared across many researchers and many institutions like a humongous floppy drive. (Well, one with a hard shell.) But there are some surprising public cloud users among the HPC set these days.
“We thought the oil and gas industry would be one of those sectors that wouldn’t move to the cloud because the datasets are so massive,” Lalonde says. “But we are seeing migration in oil and gas, and that surprises me. And the plan in those cases is to move the datasets once and just leave them up there, maybe shifting them down into a lower tier of storage when they are not using it. But maybe they may change their thinking once they start getting these big storage bills. We’ll see.”
Hybrid is a natural fit for HPC. On the one hand you have on prem, where you have the benefit of having already sunk the capital costs. That capacity is often pretty well used and HPC centers are pretty good at keeping their clusters busy – much more so than the typical enterprise and rivalling the best of the hyperscalers.
Several years back, server utilization in the datacenter was somewhere between 5 percent and 10 percent of aggregate CPU cycles, and it has probably gotten up into the 20 percent to 30 percent range thanks in large part to the use of virtual machines and containers, which allow bin packing of applications. The best back-end transaction processing and database servers are probably somewhere around 60 percent utilization. Google, one of the best in the world at making use of distributed systems, has told us in the past that its applications tend to run at 40 percent to 60 percent utilization and some clusters are running in the 80 percent to 90 percent range. IBM’s card-walloping, bill paying, transaction processing mainframes can run at a sustained 90 percent to 95 percent utilization in production, setting the pace for efficiency. But HPC clusters running simulations and models are no slouch out there in the field, and Lalonde says that Univa’s customers tend to be running their clusters at anywhere from 60 percent to 90 percent utilization. This is one reason why, perhaps, HPC lagged in adopting cloud. They were already getting close to full use out of their clusters.
The point is, there is usually not a lot of slack left in the on premises systems, so companies burst their peaks out to the cloud, thereby forgoing need to upgrade the on-premises clusters. Also, you might want to gain access to more specialized resources that you are missing on-premises – such as GPUs or big memory, for example.
Of course, the clouds have to install the right technology to be able to absorb HPC workloads, and for a long time, the public clouds did not have what HPC applications crave: low latency and high bandwidth networking, complete with RDMA access across nodes, to drive that low latency as well as cheap and dense compute based on CPU, GPU, FPGA, or TPU compute engines. Now, the major clouds have all the pieces in place so HPC centers can finally consider offloading their peaks. Mellanox Technologies, the supplier of switch and network interface adapters for the HPC market, itself offloads some of the processing for its chip designs into the Azure cloud – using tools from Univa, as it turns out – to speed up the end of quarter tapeouts that it does for its chips. Ditto for disk drive maker Western Digital, which uses a combination of Grid Engine and Navops from Univa to offload certain aspects of disk drive design to AWS. Western Digital has succeeded in cutting its compute time from 20 days to just eight hours.
HPC centers can also use cloudy capacity for special projects that require more compute, or different kinds of compute, than they have in their own clusters. This will perhaps be the most important segment of the HPC story in the coming years. Now that all of this HPC iron is available in the cloud, programmers can test new applications or existing ones on new iron before making a commitment to buy, for example, racks of GPU laden systems, which are very expensive even if they do offer substantial acceleration of workloads and density of compute.
Cloud capacity can cost anywhere from three to nine times more than buying and running the same iron internally at high utilization. During the experimentation phase of a project, it therefore makes little sense to go out and spend hundreds of thousands to millions of dollars buying DGX-2 systems from, say, Nvidia or others. Once you do have the public cloud component, there could be a lot of back and forth for HPC workloads, jumping from on premise to cloud and back again.
“We have seen some repatriation of certain types of HPC workloads,” Lalonde says. “An organization will put an HPC application in the cloud and discover it just doesn’t make sense because it is costing too much money. So they repatriate it back to an on premises cluster and then move other types of applications out to the cloud because they do make more sense out there. As customers learn what makes economic and technical sense, in terms of compute and data storage, things will be shifting back and forth. It is not just a one-way migration.”
One of the reasons why the cost of HPC is so high is because of waste, which contributes an unnecessary 30 percent to 40 percent to the cost of HPC in the cloud. One of the keys to success of HPC in the cloud is to manage effectively the rented compute and storage capacity. One of the simplest things that many organizations fail to do is turn off what they are paying for when they are finished. This may seem ridiculous, but it is just a shift in thinking from an on-premises world where the cluster is on all the time and a job scheduler just keeps feeding it work. In this case, people need to use job schedulers to automate the firing up and the shutting down of cloud compute, and to move their data from block storage to object storage or tape storage – or simply delete it – if it is no longer needed.
Another way to watch costs is to employ a mix of spot, reserved and on-demand instances in the cloud to augment the on-premises. Finding the right balance in this mix take some experience and use of use of automation. Equally important is to find the right balance is to avoid the natural tendency to overprovision machines, something that is all that much easier in the cloud because it is just a drop-down menu. A lot of HPC centers moving applications to the cloud, like other users of rented capacity, will grab a 24-core instance when one with eight cores that costs a lot less will do the trick. Test the code out, and test how much memory is needed and test of local SSD storage is actually better than block or object storage coming in over the network on one of the public clouds. Can you get by with an NFS storage service, or do you really have to set up a parallel file system on the cloud infrastructure?
Hybrid has big benefits for HPC shops. It lets you try different scenarios and exceed the limits of your own capacity and resources. Successful, low-cost hybrid, however, means HPC shops must find the right balance between employing their own iron and that of the service provider – and managing it properly.